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Abstract 

The  Conjugate  Gradient  Method  is  the  most  prominent  iterative  method  for  solving  sparse  systems  of  linear  equations. 
Unfortunately,  many  textbook  treatments  of  the  topic  are  written  so  that  even  their  own  authors  would  be  mystified, 
if  they  bothered  to  read  their  own  writing.  For  this  reason,  an  understanding  of  the  method  has  been  reserved  for  the 
elite  brilliant  few  who  have  painstakingly  decoded  the  mumblings  of  their  forebears.  Nevertheless,  the  Conjugate 
Gradient  Method  is  a  composite  of  simple,  elegant  ideas  that  even  stupid  people  can  understand.  Of  course,  a  reader 
as  intelligent  as  yourself  will  learn  them  almost  effortlessly. 

The  idea  of  quadratic  forms  is  introduced  and  used  to  derive  the  methods  of  Steepest  Descent,  Conjugate  Directions, 
and  Conjugate  Gradients.  Eigenvectors  are  explained  and  used  to  examine  the  convergence  of  the  Jacobi  Method. 
Steepest  Descent,  and  Conjugate  Gradients.  Other  topics  include  preconditioning  and  the  nonlinear  Conjugate  Gradient 
Method.  I  have  taken  pains  to  make  this  article  easy  to  read.  Sixty-two  illustrations  are  provided.  Dense  prose  is 
avoided.  Concepts  are  explained  in  several  different  ways.  Most  equations  are  coupled  with  an  intuitive  interpretation. 
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1.  Introduction 


When  I  decided  to  learn  the  Conjugate  Gradient  Method  (henceforth,  CG),  I  read  four  different  descriptions, 
which  I  shall  politely  not  identify.  I  understood  none  of  them.  By  the  end  of  the  last,  I  swore  in  my  rage 
that  if  ever  I  unlocked  the  secrets  of  CG,  I  should  guard  them  as  jealously  as  my  intellectual  ancestors. 
Foolishly,  I  wrote  this  article  instead. 

CG  is  the  most  popular  iterative  method  for  solving  large  systems  of  linear  equations.  CG  is  effective 
for  systems  of  the  form 

Ax  =  b  ( 1 ) 

where  x  is  an  unknown  vector,  6  is  a  known  vector,  and  /I  is  a  known,  square,  symmetric,  positive-definite 
(or  positive-indefinite)  matrix.  (Don’t  worry  if  you’ve  forgotten  what  “positive-definite”  means;  we  shall 
review  it.)  These  systems  arise  in  many  important  settings,  such  as  finite  difference  and  finite  element 
methods  for  solving  partial  differential  equations,  structural  analysis,  circuit  analysis,  and  math  homework. 

Iterative  methods  like  CG  are  suited  for  use  with  sparse  matrices.  If  A  is  dense,  your  best  course  of 
action  is  probably  to  factor  A  and  solve  the  equation  by  backsubstitution.  The  time  spent  factoring  a  dense 
A  is  roughly  equivalent  to  the  time  spent  solving  the  system  iteratively;  and  once  .4  is  factored,  the  system 
can  be  backsolved  quickly  for  multiple  values  of  b.  Compare  this  dense  matrix  with  a  sparse  matrix  of 
larger  size  that  fills  the  same  amount  of  memory.  The  triangular  factors  of  a  sparse  .4  usually  have  many 
more  nonzero  elements  than  /I  itself.  Factoring  may  be  impossible  due  to  limited  memory,  and  will  be 
time-consuming  as  well;  even  the  backsolving  step  may  be  slower  than  iterative  solution.  On  the  other 
hand,  most  iterative  methods  are  memory-efficient  and  run  quickly  with  sparse  matrices. 

I  assume  that  you  have  taken  a  first  course  in  linear  algebra,  and  that  you  have  a  solid  understanding 
of  matrix  multiplication  and  linear  independence,  although  you  probably  don’t  remember  what  those 
eigenthingies  were  all  about.  From  this  foundation,  I  shall  build  the  edifice  of  CG  as  clearly  as  I  can. 


2.  Notation 


Before  we  begin,  a  few  definitions  and  notes  on  notation  are  in  order. 

With  a  few  exceptions,  I  shall  use  capital  letters  to  denote  matrices,  lower  case  letters  to  denote  vectors, 
and  Greek  letters  to  denote  scalars.  ,4  is  an  n  x  n  matrix,  and  x  and  b  are  vectors  —  that  is,  n  x  1  matrices. 
Equation  1,  written  out  fully,  looks  like  this: 

.4u  4)2  ...  A\n 

.421  Ml  Mn 

•“In  I  ■  .  .  4nn 
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The  inner  product  of  two  vectors  is  written  xTy,  and  represents  the  scalar  sum  £"=1  x>y>-  Note  that 
xTy  =  yTx.  If  x  and  y  are  orthogonal,  then  xTy  =  0.  In  general,  expressions  that  reduce  to  1  x  1  matrices, 
such  as  xTy  and  xT  Ax,  are  treated  as  scalar  values. 


Figure  1:  Sample  two-dimensional  linear  system.  The  solution  lies  at  the  intersection  of  the  lines. 

A  matrix  A  is  positive-definite  if,  for  every  nonzero  vector  x, 

xT Ax  >  0.  (2) 

This  may  mean  little  to  you,  but  don’t  feel  bad;  it’s  not  a  very  intuitive  idea,  and  it’s  hard  to  imagine  how 
a  matrix  that  is  positive-definite  might  look  differently  from  one  that  isn’t.  We  will  get  a  feeling  for  what 
positive-detiniteness  is  about  when  we  see  how  it  affects  the  shape  of  quadratic  forms. 

Finally,  don’t  forget  the  important  basic  identities  ( AB )T  =  BTAT  and  (.45)” 1  =  B~'  .4-1 . 


3.  The  Quadratic  Form 


A  quadratic  form  is  simply  a  scalar,  quadratic  function  of  a  vector  with  the  form 


fix) 


.4a:  -  bTx  -f  c 


(3) 


where  A  is  a  matrix,  x  and  6  are  vectors,  and  c  is  a  scalar  constant.  I  shall  show  shortly  that  if  .4  is  symmetric 
and  positive-definite,  /( x )  is  minimized  by  the  solution  to  Ax  =  b. 

Throughout  this  paper,  I  will  demonstrate  ideas  with  the  simple  sample  problem 


A  = 


3 

2 


c  =  0. 


(4) 


The  system  Ax  =  6  is  illustrated  in  Figure  1 .  In  general,  the  solution  x  lies  at  the  intersection  point  of  n 
hyperplanes,  each  having  dimension  n  —  1.  The  solution  in  this  case  is  x  =  [2,  -2]T.  The  corresponding 
quadratic  form  f(x)  appears  in  Figure  2.  A  contour  plot  of  /(x)  is  illustrated  in  Figure  3.  Because  .4  is 
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Figure  4:  Gradient  /'(*)  of  the  quadratic  form.  For  every  *,  the  gradient  points  in  the  direction  of  steepest 
increase  of  /(*).  and  is  orthogonal  to  the  contour  lines. 


positive-definite,  the  surface  defined  by  f(x)  is  shaped  like  a  paraboloid  bowl.  (I’ll  have  more  to  say  about 
this  in  a  moment.) 


The  gradient  of  a  quadratic  form  is  defined  to  be 

r  */(■>  l 

_3-  f(x) 

/'(*)=  812  .  •  (5) 

-  &/(*)  - 

The  gradient  is  a  vector  field  that,  for  a  given  point  x,  points  in  the  direction  of  greatest  increase  of  f(x). 
Figure  4  illustrates  the  gradient  vectors  for  Equation  3  with  the  constants  given  in  Equation  4.  At  the  bottom 
of  the  paraboloid  bowl,  the  gradient  is  zero.  One  can  minimize  f(x)  by  setting  f'(x)  equal  to  zero. 

With  a  little  bit  of  tedious  math,  one  can  apply  Equation  5  to  Equation  3,  and  derive 

/'(*)  =  ^Atx  +  ]-Ax  b.  (6) 

If  A  is  symmetric,  this  equation  reduces  to 

f\x)  =  Ax-b.  (7) 


Setting  the  gradient  to  zero,  we  obtain  Equation  1 ,  the  linear  system  we  wish  to  solve.  Therefore,  the 
solution  to  Ax  -  b  is  a  critical  point  of  /(*).  If  A  is  positive-definite  as  well  as  symmetric,  then  this 
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Figure  5:  (a)  Quadratic  form  for  .  positive-definite  matrix,  (b)  For  a  negative-definite  matrix,  (c)  For  a 
singular  (and  positive-indefinite)  matrix.  A  line  that  runs  through  the  bottom  of  the  valley  is  the  set  of 
solutions,  (d)  For  an  indefinite  matrix.  Because  the  solution  is  a  saddle  point.  Steepest  Descent  and  CG 
will  not  work.  In  three  dimensions  or  higher,  a  singular  matrix  can  also  have  a  saddle. 

solution  is  a  minimum  of  f(x),  so  Ax  =  b  can  be  solved  by  finding  an  x  that  minimizes  f(x).  (If  4  is  not 
symmetric,  then  Equation  6  hints  that  CG  will  find  a  solution  to  the  system  k(AT  +  .4) a:  =  b.  Note  that 
^(4r  +  4)  is  symmetric.) 

Why  do  symmetric  positive-definite  matrices  have  this  nice  property?  Consider  the  relationship  between 
/  at  some  arbitrary  point  p  and  at  the  solution  x  =  A~lb.  From  Equation  3  one  can  show  (Appendix  Cl) 
that  if  A  is  symmetric  (be  it  positive-definite  or  not), 

f(p)  =  f(x)  +  \(p-x)TA(p-  x).  (8) 

If  A  is  positive-definite  as  well,  then  by  Property  2,  the  latter  term  is  positive  for  all  p  ^  x.  It  follows  that 
x  is  a  global  minimum  of  /. 

The  fact  that  /(x)  is  a  paraboloid  is  our  best  intuition  of  what  it  means  for  a  matrix  to  be  positive-definite. 
If  4  is  not  positive-definite,  there  are  several  other  possibilities.  4  could  be  negative-definite  —  the  result 
of  negating  a  positive-definite  matrix  (see  Figure  2,  but  hold  it  upside-down).  4  might  be  singular,  in  which 
case  no  solution  is  unique;  the  set  of  solutions  is  a  line  or  hyperplane  having  a  uniform,  value  for  /.  If 
4  is  none  of  the  above,  then  x  is  a  saddle  point,  and  techniques  like  Steepest  Descent  and  CG  will  likely 
fail.  Figure  5  demonstrates  the  possibilities.  The  value  of  y  determines  where  the  minimum  point  of  the 
paraboloid  lies,  but  does  not  affect  the  paraboloid’s  shape. 

Why  go  to  the  trouble  of  converting  the  linear  system  into  a  tougher- looking  problem?  The  methods 
under  study  —  Steepest  Descent  and  CG  —  were  developed  and  are  intuitively  understood  in  terms  of 
minimization  problems  like  Figure  2,  not  in  terms  of  intersecting  hvperplanes  such  as  Figure  1 . 
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4.  The  Method  of  Steepest  Descent 


In  the  method  of  Steepest  Descent,  we  start  at  an  arbitrary  point  X(0)  and  slide  down  to  the  bottom  of  the 
paraboloid.  We  take  a  series  of  steps  X(i),X(2),  ■  ■  ■  until  we  are  satisfied  that  we  are  close  enough  to  the 
solution  x. 

When  we  take  a  step,  we  choose  the  direction  in  which  /  decreases  most  quickly,  which  is  the  direction 
opposite  of  /'(x(ij).  According  to  Equation  7,  this  direction  is  —  /'( X(,j )  =  b  -  Ax;,). 

Allow  me  to  introduce  a  few  definitions,  which  you  should  memorize.  The  error  =  x(,j  -  x  is  a 
vector  that  indicates  how  far  we  are  from  the  solution.  The  residual  =  6  -  Ax(t)  indicates  how  far  we 
are  from  the  correct  value  of  b.  It  is  easy  to  see  that  =  —Ae^,  and  you  should  think  of  the  residual  as 
being  the  error  transformed  by  A  into  the  same  space  as  b.  More  importantly,  =  - f'(x^)),  and  you 
should  also  think  of  the  residual  as  the  direction  of  steepest  descent.  For  nonlinear  problems,  discussed  in 
Section  14,  only  the  latter  definition  applies.  So  remember,  whenever  you  read  “residual”,  think  “direction 
of  steepest  descent.” 

Suppose  we  start  at  X(0)  =  [-2,  -2]T.  Our  first  step,  along  the  direction  of  steepest  descent,  will  fall 
somewhere  on  the  solid  line  in  Figure  6(a).  In  other  words,  we  will  choose  a  point 

x(i)  =  *(0)  +  ar(0)-  (9) 

The  question  is,  how  big  a  step  should  we  take? 

A  line  search  is  a  procedure  that  chooses  a  to  minimize  /  along  a  line.  Figure  6(b)  illustrates  this  task: 
we  are  restricted  to  choosing  a  point  on  the  intersection  of  the  vertical  plane  and  the  paraboloid.  Figure  6(c) 
is  the  parabola  defined  by  the  intersection  of  these  surfaces.  What  is  the  value  of  a  at  the  base  of  the 
parabola? 

a  minimizes  /  when  the  directional  derivative  i))  is  equal  to  zero.  By  the  chain  rule,  j~/(z(i))  = 

f(x(i))T  7%x(\)  =  f(x(  i))Tr(o)-  Setting  this  expression  to  zero,  we  find  that  a  should  be  chosen  so  that 
f(0)  and  /'(x(1))  are  orthogonal  (see  Figure  6(d)). 

There  is  an  intuitive  reason  why  we  should  expect  these  vectors  to  be  orthogonal  at  the  minimum. 
Figure  7  shows  the  gradient  vectors  at  various  points  along  the  search  line.  The  slope  of  the  parabola 
(Figure  6(c))  at  any  point  is  equal  to  the  magnitude  of  the  projection  of  the  gradient  onto  the  line  (Figure  7, 
dotted  arrows).  These  projections  represent  the  rate  of  increase  of  /  as  one  traverses  the  search  line.  /  is 
minimized  where  the  projection  is  zero  —  where  the  gradient  is  orthogonal  to  the  search  line. 


To  determine  a,  note  that  /'(x^))  =  -r^j,  and  we  have 

rU)r(0>  = 

0 

(6- Ax(,))rr(0)  = 

0 

(6-  A(x(0)  +  ar(o)))rr{0)  = 

0 

( b  -  Ax(0))rr(0)  -  a(Ar(0))rr(0)  = 

0 

(6- Ax(0))rr(0)  = 

QUr(0))rr(0) 

rfo)r(0)  = 

arS)(^r(0») 

rJo)r(o) 

a  = 

rfo)Arm 

The  Method  of  Steepest  Descent 
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Figure  6:  The  method  of  Steepest  Descent,  (a)  Starting  at  [-2.  -2]T,  take  a  step  in  the  direction  of  steepest 
descent  of  /.  (b)  Find  the  point  on  the  intersection  of  these  two  surfaces  that  minimizes  /.  (c)  This  parabola 
is  the  intersection  of  surfaces.  The  bottommost  point  is  our  target,  (d)  The  gradient  at  the  bottommost  point 
is  orthogonal  to  the  gradient  of  the  previous  step. 


Figure  7:  The  gradient  /'  is  shown  at  several  locations  along  the  search  line  (solid  arrows).  Each  gradient’s 
projection  onto  the  line  is  also  shown  (dotted  arrows).  The  gradient  vectors  represent  the  direction  of 
steepest  increase  of  /,  and  the  projections  represent  the  rate  of  increase  as  one  traverses  the  search  line. 
On  the  search  line,  /  is  minimized  where  the  gradient  is  orthogonal  to  the  search  line. 
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Figure  8:  Here,  the  method  of  Steepest  Descent  starts  at  [-2,  -2]T  and  converges  at  [2,  -2]T. 
Putting  it  all  together,  the  method  of  Steepest  Descent  is: 


r(»)  =  ft  - 

(10) 

r(0r(0 

a(0  “  rT  Ar  ’ 
r(i)A  (*) 

(11) 

*(i+i)  =  *(i)  +  a(»)r(0- 

(12) 

The  example  is  run  until  it  converges  in  Figure  8.  Note  that  the  gradient  is  always  orthogonal  to  the 
gradient  of  the  previous  step. 

The  algorithm,  as  written  above,  requires  two  matrix-vector  multiplications  per  iteration.  In  general,  the 
computational  cost  of  iterative  algorithms  is  dominated  by  matrix-vector  products;  fortunately,  one  can  be 
eliminated.  By  premultiplying  both  sides  of  Equation  12  by  -  A  and  adding  b,  we  have 

r(i+i)  =  r(»)  ~  Q(0^r(*)-  (,3) 

Although  Equation  10  is  needed  to  compute  r(0),  Equation  1 3  can  be  used  for  every  iteration  thereafter.  The 
product  Ar,  which  occurs  in  both  Equations  '  1  and  13,  need  only  be  computed  once.  The  disadvantage  of 
using  this  recurrence  is  that  the  sequence  defined  by  Equation  1 3  is  generated  without  any  feedback  from 
the  value  of  so  that  an  accumulation  of  floating  point  roundoff  error  may  cause  x^  to  converge  to 
some  point  near  x.  This  effect  can  be  avoided  by  periodically  using  Equation  10  to  recompute  the  correct 
residual. 

Before  analyzing  the  convergence  of  Steepest  Descent,  I  must  digress  to  ensure  that  you  have  a  solid 
understanding  of  eigenvectors. 
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5.  Thinking  with  Eigenvectors  and  Eigenvalues 

After  my  one  course  in  linear  algebra,  I  knew  eigenvectors  and  eigenvalues  like  the  back  of  my  head.  If 
your  instructor  was  anything  like  mine,  you  recall  solving  problems  involving  eigendoohickeys,  but  you 
never  really  understood  them.  Unfortunately,  without  an  intuitive  grasp  of  them,  CG  won’t  make  sense 
either.  If  you’re  already  eigentalented,  feel  free  to  skip  this  section. 

Eigenvectors  are  used  primarily  as  an  analysis  tool;  Steepest  Descent  and  CG  do  not  calculate  the  value 
of  any  eigenvectors  as  part  of  the  algorithm1 . 

5.1.  Eigen  do  it  if  I  try 

An  eigenvector  v  of  a  matrix  B  is  a  nonzero  vector  that  does  not  rotate  when  B  is  applied  to  it  (except 
perhaps  to  point  in  precisely  the  opposite  direction),  v  may  change  length  or  reverse  its  direction,  but  it 
won’t  turn  sideways.  In  other  words,  there  is  some  scalar  constant  A  such  that  Bv  =  Xv.  The  value  A  is 
an  eigenvalue  of  B.  For  any  constant  a,  the  vector  av  is  also  an  eigenvector  with  eigenvalue  A,  because 
B(av)  =  aBv  =  A  av.  In  other  words,  if  you  scale  an  eigenvector,  it’s  still  an  eigenvector. 

Why  should  you  care?  Iterative  methods  often  depend  on  applying  B  to  a  vector  over  and  over  again. 
When  B  is  repeatedly  applied  to  an  eigenvector  v,  one  of  two  things  can  happen.  If  |  A|  <  1,  then  B'v  =  X'v 
will  vanish  as  i  approaches  infinity  (Figure  9).  If  |A|  >  1,  then  B'v  will  grow  to  infinity  (Figure  10).  Each 
time  B  is  applied,  the  vector  grows  or  shrinks  according  to  the  value  of  |  A  | . 


Figure  9:  v  is  an  eigenvector  of  B  with  a  corresponding  eigenvalue  of  -0.5.  As  i  increases,  B'v  converges 
to  zero. 


Figure  10:  Here,  v  has  a  corresponding  eigenvalue  of  2.  As  i  increases,  B'v  diverges  to  infinity. 

1  However,  there  are  practical  applications  for  eigenvectors.  The  eigenvectors  of  the  stiffness  matrix  associated  with  a  discretized 
structure  of  uniform  density  represent  the  natural  modes  of  vibration  of  the  structure  being  studied.  For  instance,  the  eigenvectors 
of  the  stiffness  matrix  associated  with  a  one-dimensional  uniformly-spaced  mesh  are  sine  waves,  and  to  express  vibrations  as  a 
linear  combination  of  these  eigenvectors  is  equivalent  to  performing  a  discrete  Fourier  transform. 
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If  B  is  nonsingular,  then  there  exists  a  set  of  n  linearly  independent  eigenvectors  of  B,  denoted 
t>i ,  V2, . . . ,  vn.  This  set  is  not  unique,  because  each  eigenvector  can  be  scaled  by  an  arbitrary  nonzero 
constant.  Each  eigenvector  has  a  corresponding  eigenvalue,  denoted  A!(  A2, . . . ,  An.  These  are  uniquely 
defined  for  a  given  matrix.  The  eigenvalues  may  or  may  not  be  equal  to  each  other;  for  instance,  the 
eigenvalues  of  the  identity  matrix  /  are  all  one,  and  every  nonzero  vector  is  an  eigenvector  of  I. 

What  if  B  is  applied  to  a  vector  that  is  not  an  eigenvector?  A  very  important  skill  in  understanding 
linear  algebra  —  the  skill  this  section  is  written  to  teach  —  is  to  think  of  a  vector  as  a  sum  of  other 
vectors  whose  behavior  is  understood.  Consider  that  the  set  of  eigenvectors  { v, }  forms  a  basis  for  Rn 
(because  a  nonsingular  B  has  n  eigenvectors  that  are  linearly  independent).  Any  n-dimensional  vector  can 
be  expressed  as  a  linear  combination  of  eigenvectors,  and  because  matrix  multiplication  is  distributive,  one 
can  examine  the  effect  of  B  on  each  eigenvector  separately. 

In  Figure  11,  a  vector  x  is  illustrated  as  a  sum  of  two  eigenvectors  t>i  and  v2.  Applying  B  to  x  is 
equivalent  to  applying  B  to  the  eigenvectors,  and  summing  the  result.  Upon  repeated  application,  we  have 
B'x  =  B'v  1  +  B'v 2  =  A\ V|  +  \\vi.  If  the  magnitudes  of  all  the  eigenvalues  are  smaller  than  one,  B'x 
will  converge  to  zero,  because  the  eigenvectors  which  compose  x  converge  to  zero  when  B  is  repeatedly 
applied.  If  one  of  the  eigenvalues  has  magnitude  greater  than  one,  x  will  diverge  to  infinity.  This  is  why 
numerical  analysts  attach  importance  to  the  spectral  radius  of  a  matrix: 

p(B)  =  max  |  A, |,  A,  is  an  eigenvalue  of  B. 

If  we  want  x  to  converge  to  zero  quickly,  p(B)  should  be  less  than  one,  and  preferably  as  small  as  possible. 


Figure  1 1 :  The  vector  x  (solid  arrow)  can  be  expressed  as  a  linear  combination  of  eigenvectors  (dashed 
arrows),  whose  associated  eigenvalues  are  A,  =  0.7  and  \2  =  -2.  The  effect  of  repeatedly  applying  B  to 
x  is  best  understood  by  examining  the  effect  of  B  on  each  eigenvector.  When  B  is  repeatedly  applied,  one 
eigenvector  converges  to  zero  while  the  other  diverges;  hence,  B'x  also  diverges. 

Here’s  a  useful  fact:  the  eigenvalues  of  a  positive-definite  matrix  are  all  positive.  This  fact  can  be  proven 
from  the  definition  of  eigenvalue: 


Bv  —  \v 
vT  Bv  =  \vTv. 

By  the  definition  of  positive-definite,  the  left-hand  term  is  positive  (for  nonzero  v).  Hence,  A  must  be 
positive  also. 


5.2.  Jacobi  iterations 

Of  course,  a  procedure  that  always  converges  to  zero  isn’t  going  to  help  you  attract  friends.  Consider  a  more 
useful  procedure:  the  Jacobi  Method  for  solving  Ax  =  b.  The  matrix  A  is  split  into  two  parts:  D,  whose 
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diagonal  elements  are  identical  to  those  of  A,  and  whose  off-diagonal  elements  are  zero;  and  E,  whose 
diagonal  elements  are  zero,  and  whose  off-diagonal  elements  are  identical  to  those  of  A.  Thus,  A  =  D  +  E. 
We  derive  the  Jacobi  Method: 

Ax  =  y 
Dx  =  -Ex  +  y 
x  =  -D~lEx  +  D~'b 

x  =  Bx  +  z,  where  B  =  -D~lE,  z  =  D~lb.  (14) 

Note  that  because  D  is  diagonal,  it  is  easy  to  invert.  This  identity  can  be  converted  into  an  iterative  method 
by  forming  the  recurrence 

x{i+\)  =  Bx{i)  +  z.  (15) 

Given  a  starting  vector  i(o)>  this  formula  generates  a  sequence  of  vectors.  Our  hope  is  that  each  successive 
vector  will  be  closer  to  the  solution  x  than  the  last,  x  is  called  a  stationary  point  of  Equation  1 5,  because  if 
Z(i)  =  x,  then  Z(i+i)  will  also  equal  x. 

Now,  this  derivation  may  seem  quite  arbitrary  to  you,  and  you’re  right.  We  could  have  formed  any 
number  of  identities  for  x  instead  of  Equation  14.  In  fact,  simply  by  splitting  A  differently  —  that  is, 
by  choosing  a  different  D  and  E  —  we  could  have  derived  the  GauB-Seidel  method,  or  the  method  of 
Successive  Over-Relaxation  (SOR).  Our  hope  is  that  we  have  chosen  a  splitting  for  which  B  has  a  small 
spectral  radius.  I  chose  the  Jacobi  splitting  arbitrarily  for  simplicity. 

Suppose  we  start  with  some  arbitrary  vector  *(0)-  For  each  iteration,  we  apply  B  to  this  vector,  then  add 
z  to  the  result.  What  does  each  iteration  do? 

Again,  apply  the  principle  of  thinking  of  a  vector  as  a  sum  of  other,  well-understood  vectors.  Express 
each  iterate  Z(,)  as  the  sum  of  the  exact  solution  x  and  the  error  term  e^y  Then,  Equation  15  becomes 

*(i+i)  =  Bxm  +  z 

=  B(x  +  e{i))  +  z 
=  Bz  +  z  +  Betf 

=  x  +  jBe^)  (by  Equation  14), 

C(»+i)  =  Be^y  (16) 

Each  iteration  does  not  affect  the  “correct  part”  of  x ^  (because  x  is  a  stationary  point);  but  each  iteration 
does  affect  the  error  term.  It  is  apparent  from  Equation  16  that  if  p(B)  <  1,  then  the  error  term  will 
converge  to  zero  as  *  approaches  infinity.  Hence,  the  initial  vector  Z(0)  has  no  effect  on  the  inevitable 
outcome! 

Of  course,  the  choice  of  Z(0)  does  affect  the  number  of  iterations  required  to  converge  to  x  within 
a  given  tolerance.  However,  its  effect  is  less  important  than  that  of  the  spectral  radius  p(B),  which 
determines  the  speed  of  convergence.  Suppose  that  vj  is  the  eigenvector  of  B  with  the  largest  eigenvalue 
(so  that  p(B)  =  A j).  If  the  initial  error  e(0),  expressed  as  a  linear  combination  of  eigenvectors,  includes  a 
component  in  the  direction  of  Vj,  this  component  will  be  the  slowest  to  conveige.  The  convergence  of  the 
Jacobi  Method  can  be  described  as  follows: 

H)\\  <  W*)fll«wll 

where  |je||  =  (eTe)1^2  is  the  Euclidean  norm  (length)  of  e.  (In  fact,  the  inequality  holds  for  any  norm.) 
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Figure  12:  The  eigenvectors  of  A  are  directed  along  the  axes  of  the  paraboloid  defined  by  the  quadratic 
form  f(x).  Each  eigenvector  is  labeled  with  its  associated  eigenvalue.  Each  eigenvalue  is  proportional  to 
the  steepness  of  the  corresponding  slope. 

The  convergence  of  the  Jacobi  Method  depends  on  p(B),  which  depends  on  A.  Unfortunately,  Jacobi 
does  not  converge  for  every  A,  or  even  for  every  positive-definite  A. 


S3.  A  Concrete  Example 

To  demonstrate  these  ideas,  I  shall  solve  the  system  specified  by  Equation  4.  First,  we  need  a  method  of 
finding  eigenvalues  and  eigenvectors.  By  definition,  for  any  eigenvector  v  with  eigenvalue  A, 

Av  =  Xv  =  XI  v 

(XI  -  A)v  =  0. 

Eigenvectors  are  nonzero,  so  XI  -  A  must  be  singular.  Then, 

det(A/  -  A)  =  0. 


The  determinant  of  XI  -  A  is  called  the  characteristic  polynomial.  It  is  an  n-degree  polynomial  in  A 
whose  roots  are  the  set  of  eigenvalues.  The  characteristic  polynomial  of  A  (from  Equation  4)  is 


det 


A  -  3  -2 

-2  A  -  C 


=  A2  -  9A  +  14  =  (A  -  7)(A  -  2), 


and  the  eigenvalues  are  7  and  2.  To  find  the  eigenvector  associated  with  A  =  7, 


(A  I-A)v  = 


4  -2  ' 

t?i 

-2  1 

V2 

=  0 

\  4vj  —  2v2  —  0. 
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Any  solution  to  this  equation  is  an  eigenvector,  say,  v  =  [1 , 2]7.  By  the  same  method,  find  that  [-2,  l]7 
is  an  eigenvector  corresponding  to  the  eigenvalue  2.  In  Figure  12,  we  see  that  these  eigenvectors  coincide 
with  the  axes  of  the  familiar  ellipsoid,  and  that  a  larger  eigenvalue  corresponds  to  a  steeper  slope.  (Negative 
eigenvalues  indicate  that  /  decreases  along  the  axis,  as  in  Figures  5(b)  and  5(d).) 


Now,  let’s  see  the  Jacobi  Method  in  action.  Using  the  constants  specified  by  Equation  4,  we  have 


*(•+») 


0 

0  2 

1 

6  . 

2 

2  0 

*(»>  + 

■ 

'  2  ‘ 

3 

0 

*(0  + 

“5 

5  O' 

2  1 

L  0  sj 

00 

1 

The  eigenvectors  of  B  are  [\/2, 1]T  with  eigenvalue  -\/2/3,  and  [-\/2.  ljr  with  eigenvalue  v/2/3. 
These  are  graphed  in  Figure  13(a);  note  that  they  do  not  coincide  with  the  eigenvectors  of  A,  and  are  not 
related  to  the  axes  of  the  paraboloid. 

Figure  13(b)  shows  the  convergence  of  the  Jacobi  method.  The  mysterious  path  the  algorithm  takes  can 
be  understood  by  watching  the  eigenvector  components  of  each  successive  error  term  (Figures  13(c),  (d), 
and  (e)).  Figure  13(f)  plots  the  eigenvector  components  as  arrowheads.  These  are  converging  normally  at 
the  rate  defined  by  their  eigenvalues,  as  in  Figure  11. 

I  hope  that  this  section  has  convinced  you  that  eigenvectors  are  useful  tools,  and  not  just  bizarre  torture 
devices  inflicted  upon  you  by  your  professors  for  the  pleasure  of  watching  you  suffer  (although  the  latter  is 
a  nice  fringe  benefit). 


6.  Convergence  Analysis  of  Steepest  Descent 


6.1.  Instant  Results 


To  understand  the  convergence  of  Steepest  Descent,  let’s  first  consider  the  case  where  is  an  eigenvector 
with  eigenvalue  Ae.  Then,  the  residual  =  -Ae^)  =  is  also  an  eigenvector.  Equation  12  gives 


e(i+t)  =  e(i)  + 


r(V(-> 

r(Vr(0 


r«) 


r(V(<) ,  .  . 

e(«)  "^  \  „r  (  ^fe(»)) 


MV(*> 


=  0. 


Figure  14  demonstrates  why  it  takes  only  one  step  to  conveige  to  the  exact  solution.  The  point  xj,)  lies 
on  one  of  the  axes  of  the  ellipsoid,  and  so  the  residual  points  directly  to  the  center  of  the  ellipsoid.  Choosing 
<*(,)  =  A“*  gives  us  instant  convergence. 

For  a  more  general  analysis,  we  must  express  e ^  as  a  linear  combination  of  eigenvectors,  and  we 
shall  furthermore  require  these  eigenvectors  to  be  orthonormal.  It  is  proven  in  Appendix  C2  that  if  .4 
is  nonsinguiar  and  symmetric,  there  exists  a  set  of  n  orthogonal  eigenvectors  of  A.  As  we  can  scale 
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Figure  1 3:  Convergence  of  the  Jacobi  Method,  (a)  The  eigenvectors  of  B  are  shown  with  their  corresponding 
eigenvalues.  Unlike  the  eigenvectors  of  A,  these  eigenvectors  are  not  the  axes  of  the  paraboloid,  (b)  The 
Jacobi  Method  starts  at  [-2,  -2}T  and  converges  at  [2,  -2]T.  (c,  d,  e)  The  error  vectors  e(0),  e( , ,,  e(2)  (solid 
arrows)  and  their  eigenvector  components  (dashed  arrows),  (f)  Arrowheads  represent  the  eigenvector 
components  of  the  first  four  error  vectors.  Each  eigenvector  component  of  the  error  is  converging  to  zero 
at  the  expected  rate  based  upon  its  eigenvalue. 
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Figure  14:  Steepest  Descent  converges  to  the  exact  solution  on  the  first  iteration  if  the  error  term  is  an 
eigenvector. 


eigenvectors  arbitrarily,  let  us  choose  so  that  each  eigenvector  is  of  unit  length.  This  choice  gives  us  the 
useful  property  that 


t  r  !•»  j — k, 
^fe  =  \0, 


17) 


18) 


Express  the  error  term  as  a  linear  combination  of  eigenvectors 

n 

e(«)  =  E  Z]vi' 

j= i 

where  £j  is  the  length  of  each  component  of  «(,) .  From  Equations  1 7  and  1 8  we  have  the  following  identities: 

r(i)  =  ~-4e(i)  =  ~E^Xivi'  (19) 

J 

ik(i)ll2  =  <£)«*)  =  ESl  (2°) 

i 

e(i)Aed)  =  Efct'J'XE&Vi) 

J  } 

=  (21) 
] 

!lr(i)l|2  —  rfi)r(i)  «  £{?*?,  <22) 

J 

rJi)Ar{i)  =  (23) 

j 

Equation  19  shows  that  too  can  be  expressed  as  a  sum  of  eigenvector  components,  and  the  length  of 
these  components  are  \j.  Equations  20  and  22  ate  just  Pythagoras’  Law. 
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Figure  15:  Steepest  Descent  converges  to  the  exact  solution  on  the  first  iteration  if  the  eigenvalues  are  all 
equal. 

Now  we  can  proceed  with  the  analysis.  Equation  12  gives 

,  r(V(0 

e('»>  =  e(>>  +  rT  Ar',"/(,) 
r(«)-rtr(0 


=  e(i)  + 


MM 

MM 


r(.) 


(24) 


We  saw  in  the  last  example  that,  if  e^)  has  only  one  eigenvector  component,  then  convergence  is 
achieved  in  one  step  by  choosing  aj,-)  =  A"1.  Now  let’s  examine  the  case  where  e^)  is  arbitrary,  but  all  the 
eigenvectors  have  a  common  eigenvalue  A.  Equation  24  becomes 


e(«+D 


A2  T  E2 

=  e(i )  + 

=  0 


Figure  15  demonstrates  why,  once  again,  there  is  instant  convergence.  Because  all  the  eigenvalues  are 
equal,  the  ellipsoid  is  spherical;  hence,  no  matter  what  point  we  start  at,  the  residual  must  point  to  the  center 
of  the  sphere.  As  before,  choose  Q(,)  =  A-1. 

However,  if  there  are  several  unequal,  nonzero  eigenvalues,  then  no  choice  of  a(,)  will  eliminate  all  the 
eigenvector  components,  and  our  choice  becomes  a  sort  of  compromise.  In  fact,  the  fraction  in  Equation  24 
is  best  thought  of  as  a  weighted  average  of  the  values  of  A  J 1 .  The  weights  E2  ensure  that  longer  components 
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Figure  16:  The  energy  norm  of  these  two  vectors  is  equal. 

of  e(j)  are  given  precedence.  As  a  result,  on  any  given  iteration,  some  of  the  shorter  components  of  «?(l) 
might  actually  increase  in  length  (though  never  for  long).  For  this  reason,  the  methods  of  Steepest  Descent 
and  Conjugate  Gradients  are  called  roughers.  By  contrast,  the  Jacobi  Method  is  a  smoother,  because  every 
eigenvector  component  is  reduced  on  every  iteration.  Steepest  Descent  and  Conjugate  Gradients  are  not 
smoothers,  although  they  are  often  erroneously  identified  as  such  in  the  mathematical  literature. 

62.  General  Convergence 

To  bound  the  convergence  of  Steepest  Descent  in  the  general  case,  we  shall  define  the  energy  norm 
IMU  =  (eTAe)x/2  (see  Figure  16).  This  norm  is  easier  to  work  with  than  the  Euclidean  norm  we  used  to 
analyze  the  Jacobi  Method.  It  is  in  some  sense  a  more  natural  norm,  because  a  bound  on  the  energy  norm 
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of  the  error  term  corresponds  to  a  bound  on  the  optimality  of  f(x).  With  this  norm,  we  have 


Sf:  +  I) 


^e(»+l) 


(«(i)  +  aWrJi))Me(i)  +  %)r(«))  (bY  Equation  12) 


e(i)Ae(i)  +  2a(.)r(0Ae(')  +  aUrl)Ar(') 


l!e(«) 


,  ^  rl)r(')  l  _T  ■  \  |  f  ’{>)'(') 

+  rT  irn  V  (*)r(«)j  +  I  T  Ar 

r(,)ArM  \r(,)^r(«) 


rl)rW 


(by  symmetry  of  4) 
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(«) 

pT 


r&^r(.) 


II2. 


(r(V(Q) 

r(',)Ar(>) 


V  (rl)Ar{i))(eJi)Ae(i))] 
(x  (EjCM)2  ) 
{  (Zi&jXLjfyj)/ 


(by  Identities  21,  22,  23) 


IMlV. 


u2  =  1  - 


(25) 


The  analysis  depends  upon  finding  an  upper  bound  for  u.  To  demonstrate  how  the  weights  and 
eigenvalues  affect  convergence,  I  shall  derive  a  result  for  n  -  2.  Assume  that  Ai  >  A2.  The  spectral 
condition  number  of  A  is  defined  to  be  k  =  Aj/A2  >  1.  The  slope  of  (relative  to  the  coordinate  system 
defined  by  the  eigenvectors)  is  denoted  p  -  £2/£i-  We  have 


_ (£?A2  +  £2A2)2 _ 

(^Ai  +  ?|A2)(^Af  +  ^A|) 

(«2  +  **2)2 
(k  +  p2)(k3  +  p2) 


(26) 


The  value  of  w,  which  determines  the  rate  of  convergence  of  Steepest  Descent,  is  graphed  as  a  function 
of  p  and  :v  in  Figure  17.  The  graph  confirms  my  two  examples.  If  e(0)  is  an  eigenvector,  then  the  slope  p 
is  zero  (or  infinite);  we  see  from  the  graph  that  u>  is  zero,  so  convergence  is  instant.  If  the  eigenvalues  are 
equal,  then  the  condition  number  k  is  one;  again,  we  see  that  u  is  zero. 

Figure  18  illustrates  examples  from  near  each  of  the  four  comers  of  Figure  17.  These  quadratic  forms 
are  graphed  in  the  coordinate  system  defined  by  their  eigenvectors.  Figures  18(a)  and  18(b)  are  examples 
with  a  large  condition  number.  Steepest  Descent  can  converge  quickly  if  a  fortunate  starting  point  is  chosen 
(Figure  18(a)),  but  is  usually  at  its  worst  when  k  is  large  (Figure  18(b)).  The  latter  figure  gives  us  our  best 
intuition  for  why  a  large  condition  number  can  be  bad:  f(x )  forms  a  trough,  and  Steepest  Descent  bounces 
back  and  forth  between  the  sides  of  the  trough  while  making  little  progress  along  its  length.  In  Figures  1 8(c) 
and  1 8(d),  the  condition  number  is  small,  so  the  quadratic  form  is  nearly  spherical,  and  convergence  is  quick 
regardless  of  the  starting  point. 

Holding  k  constant  (because  A  is  fixed),  a  little  basic  calculus  reveals  that  Equation  26  is  maximized 
when  p  =  ±k.  In  Figure  17,  one  can  see  a  faint  ridge  defined  by  this  line.  Figure  19  plots  worst-case 
starting  points  for  our  sample  matrix  A.  These  starting  points  fall  on  the  lines  defined  by  £2/£t  =  ±k.  An 
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Figure  19:  Solid  lines  represent  the  starting  points  that  give  the  worst  convergence  for  Steepest  Descent. 
Dashed  lines  represent  steps  toward  convergence.  Note  that  if  the  first  iteration  starts  from  a  worst-case 
point,  so  do  all  succeeding  iterations.  Each  step  taken  intersects  the  paraboloid  axes  (grey  arrows)  at 
precisely  a  45°  angle.  Here,  k  =  3.5. 


upper  bound  for  u>  (corresponding  to  the  worst-case  starting  points)  is  found  by  setting  pr  =  k2: 

2  4k4 

W2  <  1  -  -< - - t 

K5  +  2  K4  +  K3 

__  K5  -  2k 4  +  K3 

K5  +  2k4  -I-  K3 

(K~  l)2 

(«+  l)2 


(27) 


(28) 
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Figure  20:  Convergence  of  Steepest  Descent  (per  iteration)  worsens  as  the  condition  number  of  the  matrix 
increases. 


/(*(,))  ~  /(») 

/(*(0))  ~  /(*) 


< 


je(o)/4e(o) 


(by  Equation  8) 


7.  The  Method  of  Conjugate  Directions 


7.1.  Conjugacy 

Steepest  Descent  sometimes  finds  itself  taking  steps  in  the  same  direction  as  earlier  steps  (see  Figure  8). 
Wouldn’t  it  be  better  if,  every  time  we  took  a  step,  we  got  it  right  the  first  time?  Here’s  an  idea:  let’s  pick 
a  set  of  orthogonal  search  directions  </(0),d(i),. ..  ,d(n_i).  In  each  search  direction,  we’ll  take  exactly  one 
step,  and  that  step  will  be  just  the  right  length  to  line  up  evenly  with  x.  After  n  steps,  we’ll  be  done. 

Figure  21  illustrates  this  idea,  using  the  coordinate  axes  as  search  directions.  The  first  (horizontal)  step 
leads  to  the  correct  x \ -coordinate;  the  second  (vertical)  step  will  hit  home.  Notice  that  is  orthogonal  to 
</(o).  In  general,  for  each  step  we  choose  a  point 

X(i+|)  =  *(i)  +  <*(»)<*(«)•  (29) 

To  find  the  value  of  <*(,•),  use  the  fact  that  e(,+i)  should  be  orthogonal  to  d^,  so  that  we  need  never  step  in 
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Figure  21 :  The  Method  of  Orthogonal  Directions.  Unfortunately,  this  method  only  works  if  you  already  know 
the  answer. 


the  direction  of  d(,)  again.  Using  this  condition,  we  have 


T 

d(.)e(«+D 

«)(*(.•>  +  ft(«)rf(»)) 

«(>) 


0 

0  (by  Equation  29) 

dU)eU) 


(30) 


Unfortunately,  we  haven’t  accomplished  anything,  because  we  can’t  compute  cq,)  without  knowing 
and  if  we  knew  e^,  the  problem  would  already  be  solved. 

The  solution  is  to  make  the  search  vectors  ^-orthogonal  instead  of  orthogonal.  Two  vectors  d(,j  and 
d(j)  are  A-orthogonal ,  or  conjugate,  if 

d(i)Adb)  =  °- 

Figure  22(a)  shows  what  ,4-orthogonal  vectors  look  like.  Imagine  if  this  article  were  printed  on  bubble 
gum,  and  you  grabbed  Figure  22(a)  by  the  ends  and  stretched  it  until  the  ellipses  appeared  circular.  The 
vectors  would  then  appear  orthogonal,  as  in  Figure  22(b). 


Our  new  requirement  is  that  e(l+1)  be  .4-orthogonal  to  d(t)  (see  Figure  23(a)).  Not  coincidentally,  this 
orthogonality  condition  is  equivalent  to  finding  the  minimum  point  along  the  search  direction  d (,),  as  in 
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Figure  22:  These  pairs  of  vectors  are  A -orthogonal . . .  because  these  pairs  of  vectors  are  orthogonal. 
Steepest  Descent.  To  see  this,  set  the  directional  derivative  to  zero: 

£/(*<.>■>)  =  o 

d 

/'(*(, +t))T^*(,+l)  =  0 

~rfi+l)d(i)  ~  0 
d(i)Ae(i+l)  =  * 


Following  the  derivation  of  Equation  30,  here  is  the  expression  for  0(,)  when  the  search  directions  are 
T-orthogonal: 


a(«)  =  “ 


dJi)Aed) 

dfi)Ad{i) 


(31) 


dT*)Add)' 


(32) 


Unlike  Equation  30,  we  can  calculate  this  expression.  Note  that  if  the  search  vector  were  the  residual,  this 
formula  would  be  identical  to  the  formula  used  by  Steepest  Descent. 
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Figure  23:  The  method  of  Conjugate  Directions  converges  in  n  steps,  (a)  The  first  step  is  taken  along  some 
direction  d(0).  The  minimum  point  x(i)  is  chosen  by  the  constraint  that  e( , ,  must  be  A-orthogonal  to  dt0).  (b) 
The  initial  error  e(0)  can  be  expressed  as  a  sum  of  A -orthogonal  components  (grey  arrows).  Each  step  of 
Conjugate  Directions  eliminates  one  of  these  components. 


To  prove  that  this  procedure  really  does  compute  x  in  n  steps,  express  the  error  term  as  a  linear 
combination  of  search  directions;  namely, 


n—  1 


TO) 


=  £  Ml 


(])■ 


J=0 


(33) 


The  values  of  6j  can  be  found  by 
is  possible  to  eliminate  all  the  6j 

dJk)A: 


a  mathematical  trick.  Because  the  search  directions  are  .4-orthogonal,  it 
values  but  one  from  Expression  33  by  premultiplying  the  expression  by 


dJk)Ae(0) 

T 

d(k)Ae(  0) 

6(k ) 


Yl6U)dWAd(j) 


(by  A-orthogonality  of  d  vectors) 

dJk)Ae(  Q) 

•l(k)  Ad(  fc) 

djk)A(e( Q)  +  Efjp  a(i)d(i)) 
dJk)Ad(k) 
dfk)Ae(k) 
dJk)Ad{k) 


(by  A-orthogonality  of  d  vectors) 


(By  Equation  29) 


(34) 


By  Equations  31  and  34,  we  find  that  =  -&(,).  This  fact  gives  us  a  new  way  to  look  at  the  error 
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term. 


«(,) 


i-t 

e(0)  +  Z  QU)dU) 

i=  o 

n— 1  «— 1 

Z  6U)dU)  -  Z  SU)dU) 

j=0  j= o 

n—  1 

Z  ^u)du)' 

J=> 


(35) 


Not  surprisingly.  Equation  35  shows  that  the  process  of  building  up  x  component  by  component  can  also 
be  viewed  as  a  process  of  cutting  down  the  error  term  component  by  component  (see  Figure  23(b)).  After 
n  iterations,  every  component  is  cut  away,  and  e(n)  =  0;  the  proof  is  complete. 


12.  Gram-Schmidt  Conjugation 


All  that  is  needed  now  is  a  set  of  A -orthogonal  search  directions  {d(,)}.  Fortunately,  there  is  a  simple  way 
to  generate  them,  called  a  conjugate  Gram-Schmidt  process. 


Figure  24:  Gram-Schmidt  conjugation  of  two  vectors.  Begin  with  two  linearly  independent  vectors  u0  and 
Set  d( o)  =  u0.  The  vector  u,  is  composed  of  two  components:  u*,  which  is  A-orthogonal  (or  conjugate) 
to  dt o),  and  u+,  which  is  parallel  to  d(0).  After  conjugation,  only  the  ,4-orthogonal  portion  remains,  and 

d(i)  =  u\ 

Suppose  we  have  a  set  of  n  linearly  independent  vectors  uo,  ai , . . . ,  u„_  i .  The  coordinate  axes  will 
do  in  a  pinch,  although  more  intelligent  choices  are  possible.  To  construct  </(,),  take  u,  and  subtract  out 
any  components  which  are  not  A -orthogonal  to  the  previous  d  vectors  (see  Figure  24).  In  other  words,  set 
d(o)  -  «o.  and  for  i  >  0,  set 

t-i 

d(i)  ~  u'  +  (36) 

k= 0 

where  the  /?,*  are  defined  for  i  >  k.  To  find  their  values,  use  the  same  trick  used  to  find  by. 


dJi)AdU) 

0 

ft; 


i—  I 


uf  Ad{j)  +  Yi  PikdJk)Ad{j) 
k= o 


uT  Adu)  +  &}d(j)Adu)’ 

u?Ad(j) 


i  >  j 


(by  A-orthogonality  of  d  vectors) 


(37) 
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Figure  25:  The  method  of  Conjugate  Directions  using  the  axial  unit  vectors,  also  known  as  GauBian 
elimination. 

The  difficulty  with  using  Gram-Schmidt  conjugation  in  the  method  of  Conjugate  Directions  is  that  all 
the  old  search  vectors  must  be  kept  in  memory  to  construct  each  new  one,  and  furthermore  0(n3)  operations 
are  required  to  generate  the  full  set.  In  fact,  if  the  search  vectors  are  constructed  by  conjugation  of  the  axial 
unit  vectors.  Conjugate  Directions  becomes  equivalent  to  performing  GauBian  elimination  (see  Figure  25). 
As  a  result,  the  method  of  Conjugate  Directions  enjoyed  little  use  until  the  discovery  of  CG  —  which  is  a 
method  of  Conjugate  Directions  —  cured  these  disadvantages. 

An  important  key  to  understanding  the  method  of  Conjugate  Directions  (and  also  CG)  is  to  notice 
that  Figure  25  is  just  a  stretched  copy  of  Figure  21 !  Remember  that  when  one  is  performing  the  method 
of  Conjugate  Directions  (including  CG),  one  is  simultaneously  performing  the  method  of  Orthogonal 
Directions  in  a  stretched  (scaled)  space. 


73.  Properties  of  the  Residual 

This  section  derives  several  properties  needed  to  derive  CG. 

•  As  with  the  method  of  Steepest  Descent,  the  number  of  matrix-vector  products  per  iteration  can  be 
reduced  to  one  by  using  a  recurrence  to  find  the  residual: 

r(«+i)  =  -Ae(i+l] 

=  -A(e({)  +  «(,)d(,)) 


(38) 


Figure  26:  Because  the  search  directions  d(0),d(l)  are  constructed  from  the  vectors  u0,ui,  they  span 
the  same  subspace  (the  grey-colored  plane).  The  error  term  e(2)  is  .4-orthogonal  to  this  subspace,  the 
residual  r(2)  is  orthogonal  to  this  subspace,  and  a  new  search  direction  d(2)  is  constructed  (from  u2)  to  be 
^-orthogonal  to  this  subspace.  The  endpoints  of  u2  and  d(2)  lie  on  a  plane  parallel  to  the  shaded  subspace, 
because  d(2)  is  constructed  from  u2  by  Gram-Schmidt  conjugation. 

•  This  recurrence  suggests  that,  at  the  same  time  as  the  error  term  is  being  cut  down  component  by 
component,  these  components  being  parallel  to  the  search  directions  d^,  the  residual  is  also  being 
cut  down  component  by  component,  these  components  being  parallel  to  a  different  set  of  search 
directions  Ad^)  This  suggestion  is  confirmed  by  premultiplying  Equation  35  by  -  A: 

n— l 

r(i)  =  ~H  6(k)Ad(k)- 
k=j 

•  The  inner  product  of  and  this  expression  is 

=  0,  i  <  j  (by  /1-orthogonality  of  d-vectors).  (39) 

We  could  have  derived  this  identity  by  another  tack.  Recall  that  once  we  take  a  step  in  a  search 
direction,  we  need  never  step  in  that  direction  again;  the  error  term  is  evermore  ^-orthogonal  to  all 
the  old  search  directions.  Because  r(,j  =  —Ae^),  the  residual  is  evermore  orthogonal  to  all  the  old 
search  directions. 

•  Taking  the  inner  product  of  Equation  36  and  f(;),  we  have 

i-i 

dlfU)  =  *irU)  +  IltodTk)rU)  (4°) 

fc= o 

0  =  ufr(j),  i  <  j  (by  Identity  39).  (41) 

Each  residual  is  orthogonal  to  all  the  previous  u  vectors.  This  isn’t  surprising,  because  the  search 
vectors  are  built  from  the  u  vectors;  therefore  the  search  vectors  d^, ,  d^  span  the  same  subspace 
asuo,..., Ui,  and  (for  all  j  >  i )  is  orthogonal  to  this  subspace  (see  Figure  26). 

•  There  is  one  more  identity  we  will  use  later.  From  Equation  40, 

dlm  =  u?r(iv 


(42) 
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Figure  27:  In  the  method  of  Conjugate  Gradients,  each  new  residual  is  orthogonal  to  all  the  previous 
residuals  and  search  directions;  and  each  new  search  direction  is  constructed  (from  the  residual)  to  be 
.4-orthogonal  to  all  the  previous  residuals  and  search  directions.  The  endpoints  of  r(2)  and  di2)  lie  on  a 
plane  parallel  to  the  shaded  subspace.  In  CG,  <f(2)  is  a  linear  combination  of  r(2)  and  d(l). 

8.  The  Method  of  Conjugate  Gradients 

It  may  seem  odd  that  an  article  about  CG  doesn’t  describe  CG  until  page  28,  but  all  the  machinery  is  now  in 
place.  In  fact,  CG  is  simply  the  method  of  Conjugate  Directions  where  the  search  directions  are  constructed 
by  conjugation  of  the  residuals  (that  is,  by  setting  u,  =  r^j).  There,  we’re  done. 

Following  the  residuals  worked  well  for  Steepest  Descent,  so  it  makes  sense  to  try  using  the  residuals  to 
construct  search  directions  for  CG.  Let’s  consider  the  implications.  Equation  41  becomes 

r(V(j)  =  °*  *  #  J-  (43) 

Because  each  residual  is  orthogonal  to  the  subspace  spanned  by  the  previous  search  vectors,  all  the  residuals 
must  be  mutually  orthogonal  (see  Figure  27). 

This  relation  allows  us  to  simplify  the  terms.  Let  us  calculate  the  value  of  rJ^Ad^)  for  Equation  37. 


We  take  the  inner  product  of  r( 

,)  and  Equation  38: 

r(Vo+n  =  r 

(«)r(j)  ~  aU)rl)AdU) 

aU)rfi)Adu)  =  ram  -  rfi)ru+ 1) 

<*(.)  rfi)rl*b 

i  =  j. 

rl)Adu)  =  ’ 

i  =  j+  1, 

(By  Equation  43.) 

l  °- 

otherwise. 

=  I 

f  *  Trk"  ... 

°(-i)  ’ 

1=7  +  1, 

(By  Equation  37.) 

1 

1  0, 

i  >  j  +  1  • 

As  if  by  magic,  most  of  the  /?,2  terms  have  disappeared!  It  is  no  longer  necessary  to  store  old  search  vectors 
to  ensure  the  /I -orthogonality  of  new  search  vectors.  This  major  advance  is  what  makes  CG  as  important 
an  algorithm  as  it  is,  because  both  the  space  complexity  and  time  complexity  per  iteration  are  reduced  from 
0(n2)  to  O(m),  where  m  is  the  number  of  nonzero  entries  of  A.  Henceforth,  we  shall  use  the  abbreviation 
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Figure  28:  The  method  of  Conjugate  Gradients. 


/?(,-)  =  i.  We  simplify  further 

T 

0ti\  =  —  (by  Equation  32) 

d(i-  t)r(i-D 
fT  r 

=  T  (>*  ^ —  (by  Equation  42). 

r(i-i)r(«-D 

Let’s  put  it  all  together  into  one  piece  now.  The  method  of  Conjugate  Gradients  is: 

d{0)  ~  r(0)  =  6  -  4z(o), 

rT  r 

a/a  =  ^  (by  Equations  32  and  42), 

%)Add) 

*(«+i)  =  *(i)  +  e*(i)d(i), 
r(<+0  =  r(«)  ~  a{i)Ad{i) ’ 

.  _  r(^-M)r(t>l) 

*(*+»)  -  -T  _  ’ 

r(.)r(0 

d{i+\)  =  r(<+i)  +  0{i+\)d(i)- 

The  performance  of  CG  on  our  sample  problem  is  demonstrated  in  Figure  28.  The  name  “Conjugate 
Gradients’’  is  a  bit  of  a  misnomer,  because  the  gradients  are  not  conjugate,  and  the  conjugate  directions  are 
not  all  gradients.  “Conjugated  Gradients”  would  be  more  accurate. 


(44) 

(45) 

(46) 

(47) 

(48) 
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Figure  29:  In  this  figure,  the  shaded  area  is  £  =  e(0)  +  span{<f(0),  <*(,)}.  The  ellipsoid  is  a  contour  on  which 
the  energy  norm  is  constant.  After  two  steps,  CG  finds  e(2),  the  point  on  S  which  minimizes  ||e||4. 

9.  Convergence  Analysis  of  Conjugate  Gradients 

CG  is  complete  after  n  iterations,  so  why  should  we  care  about  convergence  analysis?  In  practice, 
accumulated  floating  point  roundoff  error  causes  the  residual  to  gradually  lose  accuracy,  and  cancellation 
error  causes  the  search  vectors  to  lose  A -orthogonality.  The  former  problem  could  be  dealt  with  as  it  was  for 
Steepest  Descent,  but  the  latter  problem  is  not  curable.  Because  of  this  loss  of  conjugacy,  the  mathematical 
community  discarded  CG  during  the  1960s,  and  interest  only  resurged  when  evidence  for  its  effectiveness 
as  an  iterative  procedure  was  published  in  the  seventies. 

A  second  concern  is  that  for  large  enough  problems,  it  may  not  be  feasible  to  run  even  n  iterations. 

The  first  iteration  of  CG  is  identical  to  the  first  iteration  of  Steepest  Descent,  so  without  changes. 
Section  6. 1  describes  the  conditions  under  which  CG  converges  on  the  first  iteration. 

9.1.  Optimality  of  the  Error  Term 

The  first  step  of  the  convergence  analysis  is  to  show  that  CG  finds  at  every  step  the  best  solution  within 
the  bounds  of  where  it’s  been  allowed  to  explore.  Where  has  it  been  allowed  to  explore?  The  value  e(l) 
is  chosen  from  the  subspace  S  -  e(0)  +  span{d(o),d(i),.. . ,  What  do  I  mean  by  “best  solution”? 

I  mean  that  CG  chooses  the  value  from  S  that  minimizes  ||e(t)||>t  (see  Figure  29).  In  fact,  some  authors 
derive  CG  by  trying  to  minimize  within  S. 

In  the  same  way  that  the  error  term  can  be  expressed  as  a  linear  combination  of  search  directions 
(Equation  35),  its  energy  norm  can  be  expressed  as  a  summation  by  taking  advantage  of  the  A -orthogonality 
of  the  search  vectors. 

n—  t 

ll«(i)IU  =  £  tfj)dij)Adti)  (fay  Equation  35). 

j=' 

Each  term  in  this  summation  is  associated  with  a  search  direction  which  has  not  yet  been  traversed.  Any 
other  vector  e  chosen  from  £  must  have  these  same  terms  in  its  expansion,  which  proves  that  must  have 
the  minimum  energy  norm. 
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A  close  examination  of  Equations  44, 46,  and  48  reveals  that  the  residual  r(  is  chosen  from  the  subspace 
K  =  +  span{  A2r(0j, . . . ,  A‘r(o)}.  This  subspace  is  called  a  Krylov  subspace ,  a  subspace  created 

by  repeatedly  applying  a  matrix  to  a  vector.  Because  r^tj  =  -  it  follows  that  e(l)  is  chosen  from  the 
subspace  S  =  e (0)  +  span{Ae(0),  A2e(0), . . . ,  A'e(o)},  and  S  is  also  a  Krylov  subspace. 

In  other  words,  for  a  fixed  i,  the  error  term  has  the  form 


e(«) 


/  +  £>,AJ 


j=i 


e(0)- 


The  coefficients  rbj  are  related  to  the  values  «*(,•)  and  0^,  but  the  precise  relation  is  not  important  here. 
What  is  important  is  the  proof  at  the  beginning  of  this  section  that  CG  chooses  these  ipj  coefficients  to 
minimize  ||e(,)|U- 

The  term  in  parentheses  above  can  be  expressed  as  a  polynomial.  Let  P,(  A)  be  a  polynomial  of  degree 
i.  P{  can  take  either  a  scalar  or  a  matrix  as  its  argument,  and  will  evaluate  to  the  same;  for  instance,  if 
P2(  A)  =  2A2  +  1,  then  P2(  A)  =  2A2  -f  /■  This  flexible  notation  comes  in  handy  for  eigenvectors,  for  which 
Pi(A)  v  =  P,(A)v. 


Now  we  can  express  the  error  term  as 


e(i)  -  -P;(A)e(0), 

if  we  require  that  P,(0)  =  1.  CG  chooses  this  polynomial  by  choosing  the  ihj  coefficients.  Let’s  examine 
the  effect  of  applying  this  polynomial  to  e(0).  As  in  the  analysis  of  Steepest  Descent,  express  e(0)  as  a  linear 
combination  of  orthogonal  unit  eigenvectors 

n 

e(0)  =  VJ  ’ 
j= i 

and  we  find  that 

e(i)  ~  ^  vj 

i 

J 

H)\\2a  =  E#w>>]2V 

j 

CG  finds  the  polynomial  that  minimizes  this  expression,  but  convergence  is  only  as  good  as  the  convergence 
of  the  worst  eigenvector,  so 

l|e<i)lfc  <  minmax(Pt(A)]2^^2AJ 

M  A 

J 

=  min  mp[Pi(A)]2||e(0)||^.  (49) 

*»  A 


Figure  30  illustrates,  for  several  values  of  i,  the  P;  that  minimizes  this  expression  for  our  sample  problem 
with  eigenvalues  2  and  7.  There  is  only  one  polynomial  of  degree  zero  that  satisfies  Po(0)  =  1,  and  that 
is  Po(A)  =  1,  graphed  in  Figure  30(a).  The  optimal  polynomial  of  degree  one  is  P\ (A)  =  1  -  2x/9, 
graphed  in  Figure  30(b).  Note  that  P|(2)  =  5/9  and  P|(7)  =  -5/9,  and  so  the  energy  norm  of  the  error 
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PoW  (a)  P\W  (b) 


^2(A)  (c)  P2(A)  (d) 


Figure  30:  The  convergence  of  CG  after  i  iterations  depends  on  how  close  a  polynomial  P,  of  degree  i  can 
be  to  zero  on  each  eigenvalue,  given  the  constraint  that  P,  (0)  =  1. 

term  after  one  iteration  of  CG  is  no  greater  than  5/9  its  initial  value.  Figure  30(c)  shows  that,  after  two 
iterations.  Equation  49  evaluates  to  zero.  This  is  because  a  polynomial  of  degree  two  can  be  fit  to  three 
points  (P2(0)  =  l,  Pz( 2)  =  0,  and  Pz{l)  -  0).  In  general,  a  polynomial  of  degree  n  can  fit  n  +  1  points, 
and  thereby  accommodate  n  separate  eigenvalues. 

The  foregoing  discussing  reinforces  our  understanding  that  CC  yields  the  exact  result  after  n  iterations; 
and  furthermore  proves  that  CG  is  quicker  if  there  are  duplicated  eigenvalues.  Given  infinite  floating  point 
precision,  the  number  of  iterations  required  to  compute  an  exact  solution  is  at  most  the  number  of  distinct 
eigenvalues. 

We  also  find  that  CG  converges  more  quickly  when  eigenvalues  are  clustered  together  (Figure  30(d)) 
than  when  they  are  evenly  distributed  between  Amtn  and  Amox,  because  it  is  easier  for  CG  to  choose  a 
polynomial  that  makes  Equation  49  small. 

If  we  know  something  about  the  characteristics  of  the  eigenvalues  of  A,  it  is  sometimes  possible  to 
suggest  a  polynomial  which  leads  to  a  proof  of  a  fast  convergence.  For  the  remainder  of  this  analysis, 
however,  I  shall  assume  the  most  general  case:  the  eigenvalues  are  evenly  distributed  between  Amin  and 
AmaI,  the  number  of  distinct  eigenvalues  is  large,  and  floating  point  roundoff  occ  irs. 


Figure  31:  Chebyshev  polynomials  of  degree  2,  5,  10,  and  49. 


92.  Chebyshev  Polynomials 


A  useful  approach  is  to  minimize  Equation  49  over  the  range  [Amm,  Amar]  rather  than  at  a  finite  number  of 
points.  The  polynomials  that  accomplish  this  are  based  on  Chebyshev  polynomials. 

The  Chebyshev  polynomial  of  degree  i  is 

T,( u)  —  i  [(«+  \/u2  -  l)'  +  (u/  -  \/w2  -  1  )‘j  . 

(If  this  expression  doesn’t  look  like  a  polynomial  to  you,  try  working  it  out  for  i  equal  to  1  or  2.)  Several 
Chebyshev  polynomials  are  illustrated  in  Figure  31.  The  Chebyshev  polynomials  have  the  property  that 
|Ti(a^)|  <  1  (in  fact,  they  oscillate  between  1  and  -1)  on  the  domain  u;  €  [-  .  1],  and  furthermore  that 
|T,(w)|  is  maximum  on  the  domain  u>  £  [-1, 1]  among  all  such  polynomials.  In  other  words,  |T,(u;)| 
increases  as  quickly  as  possible  outside  the  boxes  in  the  illustration. 


It  is  shown  in  Appendix  C4  that  Equation  49  is  minimized  by  choosing 


W)  = 


T.  (  > 

, _ 4- A _ 2\\ 

*  \  ^moi  ^m»n  / 

Ti ! 

1 

^  ^rnax  ~  ^min  ) 

This  polynomial  has  the  oscillating  properties  of  Chebyshev  polynomials  on  the  domain  Amm  <  A  <  \max 
(see  Figure  32).  The  denominator  enforces  our  requirement  that  F,(0)  =  1 .  The  numerator  has  a  maximum 
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PzW 


Figure  32:  The  polynomial  P2(\)  that  minimizes  Equation  49  for  Xm,n  =  2  and  A mox  =  7  in  the  general  case. 
This  curve  is  a  scaled  version  of  the  Chebyshev  polynomial  of  degree  2.  The  energy  norm  of  the  error  term 
after  two  iterations  is  no  greater  than  0.183  times  its  initial  value.  Compare  with  Figure  30(c),  where  it  is 
known  that  there  are  only  two  eigenvalues. 

value  of  one  on  the  interval  between  Am,„  and  Amai,  so  from  Equation  49  we  have 

lle(«)IU  ^ 


The  second  addend  inside  the  square  brackets  converges  to  zero  as  i  grows,  so  it  is  more  common  to  express 
the  convergence  of  CG  with  the  weaker  inequality 


Ti 


T, 


-^ma  x 

^moi  ''min 
K+  1N  -* 
n  -  1 


"h  ^min  A  H  || 

nr- J 

/'mtn  / 

lle(0)IU 


'y/z+iy  +  (^-iy 


s/k  -  l 


\/k  ■+•  I 


-I 


'(0)iU- 


lle(«)IU  ^  2  (^rpr)  lle(o)IU-  (50) 

Figure  33  charts  the  convergence  per  iteration  of  CG,  ignoring  the  lost  factor  of  2  (which  will  generally 
be  amortized  over  many  iterations).  Of  course,  CG  often  converges  faster  than  Equation  50  would  suggest, 
because  of  good  eigenvalue  distributions  or  good  starting  points.  Comparing  Equations  50  and  28,  it  is  clear 
that  the  convergence  of  CG  is  much  quicker  than  that  of  Steepest  Descent  (see  Figure  34).  However,  it  is 
not  necessarily  the  case  that  every  iteration  of  CG  enjoys  faster  convergence;  for  example,  the  first  iteration 
of  CG  is  always  identical  to  the  first  iteration  of  Steepest  Descent.  The  factor  of  2  in  Equation  50  allows 
CG  a  little  slack  for  these  poor  iterations. 


Figure  34:  Number  of  iterations  of 
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The  dominating  operations  during  an  iteration  of  either  Steepest  Descent  or  CG  are  matrix-vector  products. 
In  general,  matrix- vector  multiplication  requires  0(m)  operations,  where  m  is  the  number  of  non-zero 
entries  in  the  matrix.  For  many  problems,  including  those  listed  in  the  introduction,  .4  is  sparse  and 
m  6  0(n). 


Suppose  we  wish  to  perform  enough  iterations  to  reduce  the  norm  of  the  error  by  a  factor  of  5;  that  is, 
||e(l)||  <  e ||e«))||.  Equation  28  can  be  used  to  show  that  the  maximum  number  of  iterations  required  to 
achieve  this  bound  using  Steepest  Descent  is 


i  < 


whereas  Equation  50  suggests  that  the  maximum  number  of  iterations  CG  requires  is 


i  < 


I  conclude  that  Steepest  Descent  has  a  time  complexity  of  O(ttik),  whereas  CG  has  a  time  complexity  of 
0{  my/H).  Both  algorithms  have  a  space  complexity  of  0{m). 


For  finite  difference  and  finite  element  approximations  of  second-order  elliptic  boundary  value  problems 
posed  on  d-dimensional  domains,  k  6  0(  n2/d).  Thus,  Steepest  Descent  has  a  time  complexity  of  0(  n 1 )  for 
two-dimensional  problems,  versus  0(n 3^2)  forCG;  and  Steepest  Descent  has  a  time  complexity  of  0(  n5/i) 
for  three-dimensional  problems,  versus  0(  n4/3)  for  CG. 


11.  Starting  and  Stopping 

In  the  preceding  presentation  of  the  Steepest  Descent  and  Conjugate  Gradient  algorithms,  several  details 
have  been  omitted;  particularly,  how  to  choose  a  starting  point,  and  when  to  stop. 


11.1.  Starting 

There’s  not  much  to  say  about  starting.  If  you  have  a  rough  estimate  of  the  value  of  x,  use  it  as  the 
starting  value  x(0).  If  not,  set  x(0)  =  0;  either  algorithm  will  eventually  converge  when  used  to  solve  linear 
systems.  Nonlinear  systems  (coming  up  in  Section  14)  are  trickier,  though,  because  there  may  be  several 
local  minima,  and  the  choice  of  starting  point  will  determine  which  minimum  the  procedure  converges  to, 
or  whether  it  will  converge  at  all. 


11.2.  Stopping 

When  Steepest  Descent  or  CG  reaches  the  minimum  point,  the  residual  becomes  zero,  and  if  Equation  1 1 
or  47  is  evaluated  an  iteration  later,  a  division  by  zero  will  result.  It  seems,  then,  that  we  must  stop 
immediately  when  the  residual  is  zero.  To  complicate  things,  though,  accumulated  roundoff  error  in  the 
recursive  formulation  of  the  residual  (Equation  46)  may  yield  a  false  zero  residual;  this  problem  could  be 
resolved  by  restarting  with  Equation  44. 


_ Preconditioning _ 37 

Usually,  however,  one  wishes  to  stop  before  convergence  is  complete.  Because  the  error  term  is  not 
available,  it  is  customary  to  stop  when  the  norm  of  the  residual  falls  below  a  specified  value;  often,  this 
value  is  some  small  fraction  of  the  initial  residual  (||r(,)||  <  e||r(0)||).  See  Appendix  B  for  sample  code. 

12.  Preconditioning 

Preconditioning  is  a  technique  for  improving  the  condition  number  of  a  matrix.  Suppose  that  M  is  a 
symmetric,  positive-definite  matrix  that  approximates  A,  but  is  easier  to  invert.  We  can  solve  Ax  =  b 
indirectly  by  solving 

M~lAx  =  M~lb.  (51 ) 

If  k(M~‘ A)  <C  k(  A),  or  if  the  eigenvalues  of  M~l  A  are  better  clustered  than  those  of  A,  we  can  iteratively 
solve  Equation  51  more  quickly  than  the  original  problem.  The  catch  is  that  M~lA  is  not  generally 
symmetric  nor  definite,  even  if  M  and  A  are. 

We  can  circumvent  this  difficulty,  because  for  every  symmetric,  positive-definite  M  there  is  a  (not 
necessarily  unique)  matrix  E  that  has  the  property  that  EET  =  M.  (Such  an  E  can  be  obtained,  for 
instance,  by  Cholesky  factorization.)  The  matrices  M~ 1 A  and  E~l  AE~T  have  the  same  eigenvalues  (and 
therefore  the  same  condition  number).  This  is  true  because  if  v  is  an  eigenvector  of  M~ 1 A  with  eigenvalue 
A,  then  ETv  is  an  eigenvector  of  E~{AE~T  with  eigenvalue  A: 

(E~l  AE~t)(Etv)  =  ( ETE-T)E~lAv  =  EtM~'Av  =  A  ETv. 

The  system  Ax  =  b  can  be  transformed  into  the  problem 

E-'AE~tx  =  E~'b,  x  =  Etx, 

which  we  solve  first  for  x,  then  for  x.  Because  E~lAE~T  is  symmetric  and  positive-definite,  x  can  be 
found  by  Steepest  Descent  or  CG.  The  process  of  using  CG  to  solve  this  system  is  called  the  Transformed 
Preconditioned  Conjugate  Gradient  Method: 

d(o)  =  r(o)  =  E~lb-  E~l  AE~TX(0), 

_ 

Q(,)  d* E-'AE-Tdt (,)’ 

%+l)  =  %)  +  “(!')%)’ 

r(t+1)  =  ?(i)  -  «(,)£-' A£-rd(i), 

~T  —  7 

r(i)r(*) 

d(i+i)  =  f(i+ ,,  +  /?(,+!  )d(i). 

This  method  has  the  undesirable  characteristic  that  E  must  be  computed.  However,  a  few  careful 
variable  substitutions  can  eliminate  E.  Setting  r(,)  =  E~lr^  and  =  ETd^,  and  using  the  identities 
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Figure  35:  Contour  lines  of  the  quadratic  form  of  the  diagonally  preconditioned  sample  problem. 

X(,j  =  ETX(,)  and  E~T E~[  =  M~l,  we  derive  the  Untransformed  Preconditioned  Conjugate  Gradient 
Method: 


r(0)  =  b  ~  Ax(0), 

d(oj  =  M-1r(0), 


rl)M  'r(0 
“l0‘  4)^(0  ' 

I(.+1)  =  *(.)  +  Q(i)d(i), 
r(i+ 1)  =  r(.)  ~  <*(i)Ad( 


/%+D 


rli)M~lr(i) 


d(i+ 1)  =  M  1  r(»+l)  + 


The  matrix  E  does  not  appear  in  these  equations;  only  M  ~ 1  is  needed.  By  the  same  means,  it  is  possible 
to  derive  a  Preconditioned  Steepest  Descent  Method  which  does  not  use  E. 

The  effectiveness  of  a  preconditioner  M  is  determined  by  the  condition  number  of  M _l  A,  and  occa¬ 
sionally  by  its  clustering  of  eigenvalues.  The  problem  remains  of  finding  a  preconditioner  that  approximates 
A  well  enough  to  improve  convergence  enough  to  make  up  for  the  cost  of  computing  the  product  M~ 1 
once  per  iteration.  (Note  that  it  is  not  necessary  to  explicitly  compute  M  or  M~ 1 ;  it  is  only  necessary  to  be 
able  to  compute  the  effect  of  applying  M~l  to  a  vector.)  Within  this  constraint,  there  is  a  surprisingly  rich 
supply  of  possibilities,  and  I  can  only  scratch  the  surface  here. 

Intuitively,  preconditioning  is  an  attempt  to  stretch  the  quadratic  form  to  make  it  appear  more  spherical, 
so  that  the  eigenvalues  are  close  to  each  other.  A  perfect  preconditioner  is  M  =  A:  for  this  preconditioner. 
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M~ 1 A  has  a  condition  number  of  one,  and  the  quadratic  form  is  perfectly  spherical,  so  solution  takes  only 
one  iteration.  Unfortunately,  the  preconditioning  step  is  solving  the  system  Mx  -  b,  so  this  isn’t  a  useful 
preconditioner  at  all. 

The  simplest  preconditioner  is  a  diagonal  matrix  whose  diagonal  entries  are  identical  to  those  of  A.  The 
process  of  applying  this  preconditioner,  known  as  diagonal  preconditioning  or  Jacobi  preconditioning,  is 
equivalent  to  scaling  the  quadratic  form  along  the  coordinate  axes.  (By  comparison,  the  perfect  precondi¬ 
tioner  M  =  A  scales  the  quadratic  form  along  its  eigenvector  axes.)  A  diagonal  matrix  is  trivial  to  invert, 
but  is  often  only  a  mediocre  preconditioner.  The  contour  lines  of  our  sample  problem  are  shown,  after 
diagonal  preconditioning,  in  Figure  35.  Comparing  with  Figure  3,  it  is  clear  that  some  improvement  has 
occurred.  The  condition  number  has  improved  from  3.5  to  roughly  2.8.  Of  course,  this  improvement  is 
much  mote  beneficial  for  systems  where  n  »  2. 

A  more  elaborate  preconditioner  is  incomplete  Cholesky  preconditioning.  Cholesky  factorization  is  a 
technique  for  factoring  a  matrix  A  into  the  form  LLT ,  where  L  is  a  lower  triangular  matrix.  Incomplete 
Cholesky  factorization  is  a  variant  in  which  little  or  no  fill  is  allowed;  A  is  approximated  by  the  product 
L  Lt,  where  L  might  be  restricted  to  have  the  same  pattern  of  nonzero  elements  as  A ;  other  elements  of  L  are 
thrown  away.  To  use  LLT  as  a  preconditioner,  the  solution  to  LLTw  =  2  is  computed  by  backsubstitution 
(the  inverse  of  LLT  is  never  explicitly  computed).  Unfortunately,  incomplete  Cholesky  preconditioning  is 
not  always  stable. 

Many  preconditioners,  some  quite  sophisticated,  have  been  developed.  Whatever  your  choice,  it  is 
generally  accepted  that  for  large-scale  applications,  CG  should  nearly  always  be  used  with  a  preconditioner 
if  possible. 


13.  Conjugate  Gradients  on  the  Normal  Equations 


CG  can  be  used  to  solve  systems  where  A  is  not  symmetric,  not  positive-definite,  and  even  not  square.  A 
solution  to  the  least  squares  problem 

tmn||Ax  -  6)S2  (52) 

can  be  found  by  setting  the  derivative  of  Expression  52  to  zero: 

AT  Ax  =  ATb.  (53) 

If  A  is  square  and  nonsingular,  the  solution  to  Equation  53  is  the  solution  to  Ax  =  b.  If  A  is  not  square, 
and  Ax  —  b  is  overconstrained  —  that  is,  has  more  linearly  independent  equations  than  variables  —  then 
there  may  or  may  not  be  a  solution  to  Ax  =  6,  but  it  is  always  possible  to  find  a  value  of  x  that  minimizes 
Expression  52,  the  sum  of  the  squares  of  the  errors  of  each  linear  equation. 

AT  A  is  symmetric  and  positive  (for  any  x,  xT  AT  Ax  =  ||/lz||2  >  0).  \i  Ax  =  b  is  not  underconstrained, 
then  AT  A  is  nonsingular,  and  methods  like  Steepest  Descent  and  CG  can  be  used  to  solve  Equation  53.  The 
only  nuisance  in  doing  so  is  that  the  condition  number  of  AT  A  is  the  square  of  that  of  A,  so  convergence  is 
significantly  slower. 

An  important  technical  point  is  that  the  matrix  AT  A  is  never  formed  explicitly,  because  it  is  less  sparse 
than  A.  Instead,  AT  A  is  multiplied  by  d  by  first  finding  the  product  Ad,  and  then  Ar  Ad.  Also,  numerical 
stability  is  improved  if  the  value  dTAT  Ad  (in  Equation  45)  is  computed  by  taking  the  inner  product  of  Ad 
with  itself. 
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14.  The  Nonlinear  Conjugate  Gradient  Method 


CG  can  be  used  not  only  to  find  the  minimum  point  of  a  quadratic  form,  but  to  minimize  any  continuous 
function  f(x)  for  which  the  gradient  /'  can  be  computed.  Applications  include  a  variety  of  optimization 
problems,  such  as  engineering  design,  neural  net  training,  and  nonlinear  regression. 


14.1.  Outline  of  the  Nonlinear  Conjugate  Gradient  Method 


To  derive  nonlinear  CG,  there  are  three  changes  to  the  linear  algorithm:  the  recursive  formula  for  the 
residual  cannot  be  used,  it  becomes  more  complicated  to  compute  the  step  size  a,  and  there  are  several 
different  choices  for  /?. 

In  nonlinear  CG,  the  residual  is  always  set  to  the  negation  of  the  gradient;  =  - /'(*(,)  )■  The  search 
directions  are  computed  by  Gram-Schmidt  conjugation  of  the  residuals  as  with  linear  CG.  Performing  a  line 
search  along  this  search  direction  is  much  more  difficult  than  in  the  linear  case,  and  a  variety  of  procedures 
can  be  used.  As  with  the  linear  CG,  a  value  of  that  minimizes  /(*(,)  +  a^d^)  is  found  by  ensuring 
that  the  gradient  is  orthogonal  to  the  search  direction.  We  can  use  any  algorithm  that  finds  the  zeros  of  the 
expression  [/'(x(l)  +  a(i)d(i) )]Td(i). 

In  linear  CG,  there  are  several  equivalent  expressions  for  the  value  of  l).  In  nonlinear  CG,  these  different 
expressions  are  no  longer  equivalent;  researchers  are  still  investigating  the  best  choice.  Two  choices  are  the 
Fletcher-Reeves  formula,  which  we  used  in  linear  CG  for  its  ease  of  computation,  and  the  Polak-Ribi&re 
formula: 

-  'r(«+»)r(,+  l>  ,iPR  _ 

0  -  rrr  ’  %M)  - 

r(0r<«> 


FR 


rf,+  l )(r(l+,)  -  r(t)) 


tT:\T(;\ 


The  Fletcher-Reeves  method  converges  if  the  starting  point  is  sufficiently  close  to  the  desired  minimum, 
whereas  the  Polak-Ribi^re  method  can,  in  rare  cases,  cycle  infinitely  without  converging.  However,  Polak- 
Ribifcre  often  converges  much  more  quickly. 

Fortunately,  convergence  of  the  Polak-Ribiere  method  can  be  guaranteed  by  choosing  ;3  =  max{/jp/?.  0}. 
Using  this  value  is  equivalent  to  restarting  CG  if  dPR  <  0.  To  restart  CG  is  to  forget  the  past  search 
directions,  and  start  CG  anew  in  the  direction  of  steepest  descent. 

Here  is  an  outline  of  the  nonlinear  CG  method: 


^(0)  -  r(0)  =  -f\x( 0))> 


P(i+ 1) 


Find  Q(jj  that  minimizes  /(x(,)  4- 


rfi+\)r(i+l) 

r5)r(«) 


®  (»+!)  =  ^(i)  +  <*(«■)<*(»')’ 

r(.+  l)  =  -f'(x(i+ 1))> 


d(i+i)  =  ^(i+1)  +  /?(,+i)^(»)- 
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Nonlinear  CG  comes  with  few  of  the  convergence  guarantees  of  linear  CG.  The  less  similar  /  is  to  a 
quadratic  function,  the  more  quickly  the  search  directions  lose  conjugacy.  (It  will  become  clear  shortly  that 
the  term  “conjugacy”  still  has  some  meaning  in  nonlinear  CG.)  Another  problem  is  that  a  general  function 
/  may  have  many  local  minima.  CG  is  not  guaranteed  to  converge  to  the  global  minimum,  and  may  not 
even  find  a  local  minimum  if  /  has  no  lower  bound. 

Figure  36  illustrates  nonlinear  CG.  Figure  36(a)  is  a  function  with  many  local  minima.  Figure  36(b) 
demonstrates  the  convergence  of  nonlinear  CG  with  the  Fletcher-Reeves  formula.  In  this  example,  CG  is 
not  nearly  as  effective  as  in  the  linear  case;  this  function  is  deceptively  difficult  to  minimize.  Figure  36(c) 
shows  a  cross-section  of  the  surface,  corresponding  to  the  first  line  search  in  Figure  36(b).  Notice  that  there 
are  several  minima;  the  line  search  finds  a  value  of  a  corresponding  to  a  nearby  minimum.  Figure  36(d) 
shows  the  superior  convergence  of  Polak-Ribiere  CG. 

Because  CG  can  only  generate  n  conjugate  vectors  in  an  n -dimensional  space,  it  makes  sense  to  restart 
CG  every  n  iterations,  especially  if  n  is  small.  Figure  37  shows  the  effect  when  nonlinear  CG  is  restarted 
every  second  iteration.  (For  this  particular  example,  both  the  Fletcher-Reeves  method  and  the  Polak-Ribiere 
method  appear  the  same.) 


14.2.  General  Line  Search 


Depending  on  the  value  of  /',  it  might  be  possible  to  use  a  fast  algorithm  to  find  the  zeros  of  f'Td.  For 
instance,  if  /'  is  polynomial  in  a,  then  an  efficient  algorithm  for  polynomial  zero-finding  can  be  used. 
However,  we  will  only  consider  general-purpose  algorithms. 

Two  iterative  methods  for  zero-finding  are  the  Newton-Raphson  method  and  the  Secant  method.  Both 
methods  require  that  /  be  twice  continuously  differentiable.  Newton-Raphson  also  requires  that  it  be 
possible  to  compute  the  second  derivative  of  fix  +  ad)  with  respect  to  a. 


The  Newton-Raphson  method  relies  on  the  Taylor  series  approximation 
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where  fix)  is  the  Hessian  matrix 
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(55) 


The  function  fix  +  ad)  is  approximately  minimized  by  setting  Expression  55  to  zero,  giving 
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Figure  38:  The  Newton-Raphson  method  for  minimizing  a  one-dimensional  function  (solid  curve).  Starting 
from  the  point  x,  calculate  the  first  and  second  derivatives,  and  use  them  to  construct  a  quadratic  approxi¬ 
mation  to  the  function  (dashed  cun/e).  A  new  point  z  is  chosen  at  the  base  of  the  parabola.  This  procedure 
is  iterated  until  convergence  is  reached. 
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The  truncated  Taylor  series  approximates  f(x  +  ad)  with  a  parabola;  we  step  to  the  bottom  of  the  parabola 
(see  Figure  38).  In  fact,  if  /  is  a  quadratic  form,  then  this  parabolic  approximation  is  exact,  because  f"  is 
just  the  familiar  matrix  A.  In  general,  the  search  directions  are  conjugate  if  they  are  /"-orthogonal.  The 
meaning  of  “conjugate”  keeps  changing,  though,  because  f"  varies  with  x.  The  more  quickly  f"  varies 
with  x,  the  more  quickly  the  search  directions  lose  conjugacy.  On  the  other  hand,  the  closer  x(l)  is  to  the 
solution,  the  less  f"  varies  from  iteration  to  iteration.  The  closer  the  starting  point  is  to  the  solution,  the 
more  similar  the  convergence  of  nonlinear  CG  is  to  that  of  linear  CG. 

To  perform  an  exact  line  search  of  a  non-quadratic  function,  repeated  steps  must  be  taken  along  the  line 
until  f^d  is  zero;  hence,  one  CG  iteration  may  include  many  Newton-Raphson  iterations.  The  values  of 
f'Td  and  dTf"d  must  be  evaluated  at  each  step.  These  evaluations  may  be  inexpensive  if  dT  f"d  can  be 
analytically  simplified,  but  if  the  full  matrix  f"  must  be  evaluated  repeatedly,  the  algorithm  is  prohibitively 
slow.  For  some  applications,  it  is  possible  to  circumvent  this  problem  by  performing  an  approximate  line 
search  that  uses  only  the  diagonal  elements  of  f" .  Of  course,  there  are  functions  for  which  it  is  not  possible 
to  evaluate  f"  at  all. 

To  perform  an  exact  line  search  without  computing  f",  the  Secant  method  approximates  the  second 
derivative  of  f(x  4-  ad)  by  evaluating  the  first  derivative  at  the  distinct  points  a  =  0  and  a  =  cr,  where  a 
is  an  arbitrary  small  nonzero  number 


d 2  „  ,  „  _  [£f{x  +  ad)]a=a  -  [£/(x  +  ad)]o=0 

—f(x  +  ad)  ~  ; 

{J'{x  +  <rd)]Td-[f'[x))Td 


<7^0 


(56) 


which  becomes  a  better  approximation  to  the  second  derivative  as  a  and  o  approach  zero.  If  we  substitute 
Expression  56  for  the  third  term  of  the  Taylor  expansion  (Equation  54),  we  have 


j^/(x  +  ad)  «  [f'(x)]Td  +  ^  \[f'(x  +  (rd)]Td~  [f'(x)\Td}  . 


Minimize  f(x  +  ad)  by  setting  its  derivative  to  zero: 


_ {f{x)\Td _ 

a  [/'(*  +  <rd)]Td  -  [f'{x)\Td 


(57) 


Like  Newton-Raphson,  the  Secant  method  also  approximates  f(x  +  ad)  with  a  parabola,  but  instead 
of  choosing  the  parabola  by  finding  the  first  and  second  derivative  at  a  point,  it  finds  the  first  derivative  at 
two  different  points  (see  Figure  39).  Typically,  we  will  choose  an  arbitrary  a  on  the  first  Secant  method 
iteration;  on  subsequent  iterations  we  will  choose  x  4-  crd  to  be  the  value  of  x  from  the  previous  Secant 
method  iteration.  In  other  words,  if  we  let  denote  the  value  of  a  calculated  during  Secant  iteration  i, 
then<7[,+  l]  =  -«[»]• 

Both  the  Newton-Raphson  and  Secant  methods  should  be  terminated  when  x  is  reasonably  close  to  the 
exact  solution.  Demanding  too  little  precision  could  cause  a  failure  of  convergence,  but  demanding  too 
fine  precision  makes  the  computation  unnecessarily  slow  and  gains  nothing,  because  conjugacy  will  break 
down  quickly  anyway  if  f"(x)  varies  much  with  x.  Therefore,  a  quick  but  inexact  line  search  is  often 
the  better  policy  (for  instance,  use  only  a  fixed  number  of  Newton-Raphson  or  Secant  method  iterations). 
Unfortunately,  inexact  line  search  may  lead  to  the  construction  of  a  search  direction  that  is  not  a  descent 
direction.  A  common  solution  is  to  test  for  this  eventuality  (is  rTd  nonpositive?),  and  restart  CG  if  necessary 
by  setting  d  =  r. 
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Figure  39:  The  Secant  method  for  minimizing  a  one-dimensional  function  (solid  curve).  Starting  from  the 
point  x,  calculate  the  first  derivatives  at  two  different  points  (here,  a  =  0  and  a  =  2),  and  use  them  to 
construct  a  quadratic  approximation  to  the  function  (dashed  curve).  Notice  that  both  curves  have  the  same 
slope  at  a  =  0,  and  also  at  a  =  2.  As  in  the  Newton-Raphson  method,  a  new  point  ;  is  chosen  at  the  base 
of  the  parabola,  and  the  procedure  is  iterated  until  convergence. 

A  bigger  problem  with  both  methods  is  that  they  cannot  distinguish  minima  from  maxima.  The  result 
of  nonlinear  CG  generally  depends  strongly  on  the  starting  point,  and  if  CG  with  the  Newton-Raphson  or 
Secant  method  starts  near  a  local  maximum,  it  is  likely  to  converge  to  that  point. 

Each  method  has  its  own  advantages.  The  Newton-Raphson  method  has  a  better  convergence  rate,  and 
is  to  be  preferred  if  dT  f"d  can  be  calculated  (or  well  approximated)  quickly  (i.e.,  in  0{n )  time).  The  Secant 
method  only  requires  first  derivatives  of  /,  but  its  success  may  depend  on  a  good  choice  of  the  parameter  a. 
It  is  easy  to  derive  a  variety  of  other  methods  as  well.  For  instance,  by  sampling  /  at  three  different  points, 
it  is  possible  to  generate  a  parabola  that  approximates  f(x  +  ad)  without  the  need  to  calculate  even  a  first 
derivative  of  f. 

143.  Preconditioning 

Nonlinear  CG  can  be  preconditioned  by  choosing  a  preconditioner  M  that  approximates  f"  and  has  the 
property  that  M~ 1  r  is  easy  to  compute.  In  linear  CG,  the  preconditioner  attempts  to  transform  the  quadratic 
form  so  that  it  is  similar  to  a  sphere;  a  nonlinear  CG  preconditioner  performs  this  transformation  for  a  region 
nearx(j). 

Even  when  it  is  too  expensive  to  compute  the  full  Hessian  /",  it  is  often  quite  reasonable  to  compute 
its  diagonal  for  use  as  a  preconditioner.  However,  be  forewarned  that  if  x  is  sufficiently  far  from  a  local 
minimum,  the  diagonal  elements  of  the  Hessian  may  not  all  be  positive.  A  preconditioner  should  be  positive- 
definite,  so  nonpositive  diagonal  elements  cannot  be  allowed.  A  conservative  solution  is  to  not  precondition 
(set  M  =  I)  when  the  Hessian  cannot  be  guaranteed  to  be  positive-definite.  Figure  40  demonstrates 
the  convergence  of  diagonally  preconditioned  nonlinear  CG,  with  the  Polak-Ribi&re  formula,  on  the  same 
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Figure  40:  The  preconditioned  nonlinear  Conjugate  Gradient  Method,  using  the  Polak-Rib&re  formula  and 
a  diagonal  preconditioner.  The  space  has  been  “stretched”  to  show  the  improvement  in  circularity  of  the 
contour  lines  around  the  minimum. 

function  illustrated  in  Figure  36.  Here,  I  have  cheated  by  using  the  diagonal  of  f"  at  the  solution  point  x  to 
precondition  every  iteration. 


A  Notes 


Conjugate  Direction  methods  were  probably  first  presented  by  Schmidt  [14]  in  1908,  and  were  independently 
reinvented  by  Fox,  Huskey,  and  Wilkinson  [7]  in  1 948.  In  the  early  fifties,  the  method  of  Conjugate  Gradients 
was  discovered  independently  by  Hestenes  [10]  and  Stiefel  [15];  shortly  thereafter,  they  jointly  published 
what  is  considered  the  seminal  reference  on  CG  [11].  Convergence  bounds  for  CG  in  terms  of  Chebyshev 
polynomials  were  developed  by  Kaniei  [12].  A  more  thorough  analysis  of  CG  convergence  is  provided  by 
van  der  Sluis  and  van  der  Vorst  [16].  CG  was  popularized  as  an  iterative  method  for  large,  sparse  matrices 
by  Reid  [13]  in  1971. 

CG  was  generalized  to  nonlinear  problems  in  1964  by  Fletcher  and  Reeves  [6],  based  on  work  by 
Davidon  [4]  and  Fletcher  and  Powell  [5].  Convergence  of  nonlinear  CG  wuh  inexact  line  searches  was 
analyzed  by  Daniel  [3].  The  choice  of  i3  for  nonlinear  CG  is  discussed  by  Gilbert  and  Nocedal  [8]. . 

A  history  and  extensive  annotated  bibliography  of  CG  to  the  mid-seventies  is  provided  by  Golub  and 
O’Leary  [9],  Most  research  since  that  time  has  focused  on  nonsymmetric  systems.  A  survey  of  iterative 
methods  for  the  solution  of  linear  systems  is  offered  by  Barrett  et  al.  [  1  ]. 
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B  Canned  Algorithms 

The  code  given  in  this  section  represents  efficient  implementations  of  the  algorithms  discussed  in  this  article. 


Bl.  Steepest  Descent 

Given  the  inputs  A,  6,  a  starting  value  x ,  a  maximum  number  of  iterations  imar,  and  an  error  tolerance 

e  <  1: 


This  algorithm  terminates  when  the  maximum  number  of  iterations  imax  has  been  exceeded,  or  when 

llr(«)H  ^  "llr(0)ll- 

The  fast  recursive  formula  for  the  residual  is  usually  used,  but  once  every  50  iterations,  the  exact  residual 
is  recalculated  to  remove  accumulated  floating  point  error.  Of  course,  the  number  50  is  arbitrary;  for  large 
n,  s/n  might  be  appropriate.  If  the  tolerance  is  large,  the  residual  need  not  be  corrected  at  all.  If  the 
tolerance  is  close  to  the  limits  of  the  floating  point  precision  of  the  machine,  a  test  should  be  added  after  6 
is  evaluated  to  check  if  <5  <  s26q,  and  if  this  test  holds  true,  the  exact  residual  should  also  be  recomputed 
and  6  reevaluated.  This  prevents  the  procedure  from  terminating  too  quickly  due  to  floating  point  roundoff 
error. 
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Given  the  inputs  A,  b,  a  starting  value  t,  a  maximum  number  of  iterations  imax,  and  an  error  tolerance 

s  <  l: 

i  <=  0 

r  <=  b  —  Ax 
d  ^  r 

fine w  ^  T 

fio  ^  fine  w 

While  i  <  imax  and  6new  >  s2SQ  do 
q  <*=  Ad 

x  <=  x  +  ad 
If  i  is  divisible  by  50 
r  <=  b  —  Ax 
else 

r  <=  r  -  aq 
fiold  ^  finew 

fine w  4=  rTr 

d  •$=  r  +  j3d 

i  <S=  i  +  1 


See  the  comments  at  the  end  of  Section  Bl. 
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B3.  Preconditioned  Conjugate  Gradients 

Given  the  inputs  A,  b,  a  starting  value  x,  a  (perhaps  implicitly  defined)  preconditioner  M,  a  maximum 
number  of  iterations  imax,  and  an  error  tolerance  e  <  1: 


The  statement  “s  •<=  M  1  r”  implies  that  one  should  apply  the  preconditioner,  which  may  not  actually 
be  in  the  form  of  a  matrix. 


See  also  the  comments  at  the  end  of  Section  B 1 . 
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B4.  Nonlinear  Conjugate  Gradients  with  Newton-Raphson  and  Fletcher-Reeves 


Given  a  function  /,  a  starting  value  x,  a  maximum  number  of  CG  iterations  imai,  a  CG  error  tolerance 
s  <  1,  a  maximum  number  of  Newton-Raphson  iterations  jmax,  and  a  Newton-Raphson  error  tolerance 

€  <  1: 


This  algorithm  terminates  when  the  maximum  number  of  iterations  imax  has  been  exceeded,  or  when 

Iholi  <  £iir(<>)ii- 

Each  Newton-Raphson  iteration  adds  ad  to  x;  the  iterations  are  terminated  when  each  update  ad  falls 
below  a  given  tolerance  (||ad||  <  e),  or  when  the  number  of  iterations  exceeds  jmax-  A  fast  inexact  line 
search  can  be  accomplished  by  using  a  small  jmax  and/or  by  approximating  the  Hessian  f"(x)  with  its 
diagonal. 

Nonlinear  CG  is  restarted  (by  setting  d  <=  r)  whenever  a  search  direction  is  computed  that  is  not  a 
descent  direction.  It  is  also  restarted  once  every  n  iterations,  to  improve  convergence  for  small  n. 

The  calculation  of  a  may  result  in  a  divide-by-zero  error.  This  may  occur  because  the  starting  point  X(6) 
is  not  sufficiently  close  to  the  desired  minimum,  or  because  /  is-  not  twice  continuously  differentiable.  In 
the  former  case,  the  solution  is  to  choose  a  better  starting  point  or  a  more  sophisticated  line  search.  In  the 
latter  case,  CG  might  not  be  the  most  appropriate  minimization  algorithm. 
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B5.  Preconditioned  Nonlinear  Conjugate  Gradients  with  Secant  and  Polak-Ribifere 

Given  a  function  /,  a  starting  value  x,  a  maximum  number  of  CG  iterations  imax,  a  CG  error  tolerance 
s  <  l,  a  Secant  method  step  parameter  <jq,  a  maximum  number  of  Secant  method  iterations  jmax>  and  a 
Secant  method  error  tolerance  e  <  1 : 


This  algorithm  terminates  when  the  maximum  number  of  iterations  imax  has  been  exceeded,  or  when 

Ikfoll  <  £lho)ll- 

Each  Secant  method  iteration  adds  ad  to  x;  the  iterations  are  terminated  when  each  update  ad  falls  below 
a  given  tolerance  (||ad||  <  e),  or  when  the  number  of  iterations  exceeds  jmax.  A  fast  inexact  line  search  can 
be  accomplished  by  using  a  small  jmax.  The  parameter  <70  determines  the  value  of  a  in  Equation  57  for  the 
first  step  of  each  Secant  method  minimization.  Unfortunately,  it  may  be  necessary  to  adjust  this  parameter 
to  achieve  convergence. 
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r(.+  l)Af-1(r(.  +  l)-r(,)) 


.  Care  must 


The  Polak-Ribifcre  j3  parameter  is  Sn%5dmut  =  r(,)',  Care  must 

be  taken  that  the  preconditioner  M  is  always  positive-definite.  The  preconditioner  is  not  necessarily  in  the 
form  of  a  matrix. 


Nonlinear  CG  is  restarted  (by  setting  d  <i=  r)  whenever  the  Polak-Ribifere  /3  parameter  is  negative.  It  is 
also  restarted  once  every  n  iterations,  to  improve  convergence  for  small  n. 

Nonlinear  CG  presents  several  choices:  Preconditioned  or  not,  Newton-Raphson  method  or  Secant 
method  or  another  method,  Fletcher-Reeves  or  Polak-Ribifcre.  It  should  be  possible  to  construct  any  of  the 
variations  from  the  versions  presented  above.  (Note  that  Polak-Ribi£re  is  almost  always  to  be  preferred.) 


C  Ugly  Proofs 

Cl.  The  Solution  to  Ax  =  b  Minimizes  the  Quadratic  Form 

Suppose  A  is  symmetric.  Let  x  be  a  point  that  satisfies  Ax  =  b  and  minimizes  the  quadratic  form 
(Equation  3),  and  let  e  be  an  error  term.  Then 

f(x  +  «)  =  ^(x  +  e)T A(x  +  e)  -  bT(x  +  e)  -I-  c  (by  Equation  3) 

=  ^xtAx  +  eT Ax  +  )-eT  Ae  -  bTx  -  bT e  +  c  (by  symmetry  of  A) 

it  i* 

—  \xtAx  -  bT x  +  c  +  eTb  +  ]-eT Ae  -  bTe 
=  /(*)  +  \eTAe- 

If  A  is  positive-definite,  then  the  latter  term  is  positive  for  all  e  ^  0;  therefore  x  minimizes  /. 


C2.  A  Symmetric  Matrix  Has  n  Orthogonal  Eigenvectors. 

Suppose  A  is  nonsingular,  so  that  there  is  a  set  of  n  linearly  independent  eigenvectors  iq,  V2, . . . ,  vn.  Suppose 
furthermore  that  A  is  symmetric.  For  any  two  eigenvectors,  we  have 

vf  vj  =  vj  A~ 1  Avj 

=  (A-1u,)rAvj  (by  symmetry  of  A) 

Xj  T 

=  ~vi  VJ- 

This  equation  can  hold  only  if  vf  vj  =  0  or  A,  =  Xj.  In  the  former  case,  vt  and  v3  are  orthogonal.  In 
the  latter  case,  both  r,  and  v}  have  the  same  eigenvalue  A,  so  any  linear  combination  of  u,  and  vj  is  also 
an  eigenvector  with  eigenvalue  A.  Replace  vj  with  a  vector  that  is  a  linear  combination  of  v,  and  Vj,  and 
is  orthogonal  to  u,.  (In  other  words,  apply  Gram-Schmidt  orthogonalization  to  v,  and  Vj,  and  any  other 
vectors  with  the  same  eigenvalue.) 

By  applying  this  reasoning  to  every  pair  of  eigenvectors,  one  can  generate  an  orthogonal  set  of  eigen¬ 
vectors. 


C3.  Convergence  of  Steepest  Descent 


Ugly  Proofs 
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V 


Consider  the  expression  u>  from  Equation  25: 


u2  =  l- 


Several  simplifications  can  be  made: 


i  j 


A; 


^min^ 


mmAmoi 


\3 

-  ^ 


^mtn^ 


^  ~t"  ^mai  V-™'  r2 

-  \  a  Z^r? 


AmmAmax 


(because  A  min  ^  Aj,  ^  Amax) 


Amin  Aw 


^mm/'max 


If  we  define  <t>  =  J2j  A],  we  have 


j  j  Amin  /'n 


'mtn/'max 


Basic  calculus  shows  that  this  expression  is  maximized  when  <f>  =  =j(Amtn  +  Amu)  Ej  (j  Aj,  so  we  have 


Amin^mai 


It  follows  that 


w2  <  1  - 


4Amin  A 


mmAmai 


W  < 


(Amin  "l"  Amaa:)2 
(Amin  ~  A max)^ 
(Amin  "f-  Amai)2 
K  —  1 


-  k  +  r 


where  k  =  4®“-.  Thus,  Equation  27  is  true  for  any  n  >  l. 


C4.  Optimality  of  Chebyshev  Polynomials 

Chebyshev  polynomials  are  optimal  for  minimization  of  expressions  like  Expression  49  because  they 
increase  in  magnitude  more  quickly  outside  the  range  [—1,1]  than  any  other  polynomial  that  is  restricted  to 
have  magnitude  no  greater  than  one  inside  the  range  [-1,1]. 

The  Chebyshev  polynomial  of  degree  t, 

Ti(u>)  ii[(u+  n/u,2  +  1)’  +  (u  -  s/u2  -  1)‘]  , 
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can  also  be  expressed  on  the  region  [  - 1 , 1]  as 


Ti(uj)  =  cos(icos-1  u>),  -1  <  u  <  l. 

From  this  expression  (and  from  Figure  31),  one  can  deduce  that  the  Chebyshev  polynomials  have  the 
property 

|Ti(w)|  <  1,  -1<«<1 

and  oscillate  rapidly  between  - 1  and  1: 

T>  (cos  (t))  =  (-1)*’  k  =  °,1 . *• 

Notice  that  the  i  zeros  of  T,  must  fall  between  the  i  +  1  extrema  of  T,  in  the  range  [-1,1].  For  an  example, 
see  the  five  zeros  of  Ts(u>)  in  Figure  31. 

Similarly,  the  function 

rp.  |  Amai+Amtn-2A \ 

_  1  \  Amal  —  Amin  J 

Tt  (  Arnai-t-Am.n  \ 

oscillates  in  the  range  ±T<( on  the  domain  [Amm,  Amax],  /»■(  A)  also  satisfies  the  requirement 
that  Pi{ 0)  =  1. 

The  proof  relies  on  a  cute  trick.  There  is  no  polynomial  Ql(X)  of  degree  i  such  that  <3,(0)  =  1  and  Qt 
is  better  than  Pt  on  the  range  [Amin,  Amaa;].  To  prove  this,  suppose  that  there  is  such  a  polynomial;  then, 
<  r'(ww)''  on  the  [A  mini  Amaa.  ].  It  follows  that  the  polynomial  P,  -  Qx  has  a  zero 

at  A  =  0,  and  also  has  i  zeros  on  the  range  [Amin,  Amar],  Hence,  Pt  —  Q,  is  a  polynomial  of  degree  i  with 
at  least  i  +  1  zeros,  which  is  impossible.  By  contradiction,  I  conclude  that  the  Chebyshev  polynomial  of 
degree  i  optimizes  Expression  49. 
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